Table of Contents
- 1. Overview
- 2. The SwiftScript Language
- 3. Mappers
- 4. Swift configuration properties
- 5. The swift command
- 6. Kickstart
- 7. Workflow restart/recovery
- 8. Invoking an application from Swift
- 9. Technical overview of the Swift architecture
- 10. Ways in which Swift can be extended
- 11. Function reference
- 12. Built-in procedure reference
- 13. Profiles
- 14. Clustering
- 15. How-To Tips for Specific User Communities
- 16. The Site Catalog - sites.xml
- 17. The Transformation Catalog - tc.data
- 18. Environment variables
- 19. Build options
This manual provides reference material for Swift: the SwiftScript language and the Swift runtime system. For introductory material, consult the Swift tutorial.
Swift is a data-oriented coarse grained scripting language that supports dataset typing and mapping, dataset iteration, conditional branching, and sub-workflow composition.
Swift programs (or workflows) are written in a language called SwiftScript.
SwiftScript programs are dataflow oriented - they are primarily concerned with processing (possibly large) data files, by invoking programs to do that processing. Swift handles execution of such programs on remote sites by choosing sites, handling the staging of input and output files to and from the chosen sites and remote execution of program code.
Data processed by Swift is strongly typed. It may be take the form of values in memory or as out-of-core files on disk. Language constructs called mappers specify how each piece of data is stored.
Data is represented in Swift by DSHandles (Dataset handles).
In Swift, a DSHandle can represent data in one of three forms:
an in-memory value, such as a string or integer.
a data file (on local disk or stored elsewhere on the internet). When a DSHandle represents an on-disk data file, the name of that file is provided by a mapper.
a container of other DSHandles - either an array or a defined type. Such a DSHandles contains subordinate DSHandles; these may be nested to arbitrary depth.
The above three are mutually exclusive - for example, a data set that is mapped to a file cannot have a value that can be used in a SwiftScript expression; and a data item that is a value cannot be passed into an application executable as a data file using the @filename function.
When a DSHandle represents a data file (or container of datafiles), it is associated with a mapper. The mapper is used to identify which files belong to that DSHandle.
A dataset's physical representation is declared by a mapping descriptor, which defines how each element in the dataset's logical schema is stored in, and fetched from, physical structures such as directories, files, and remote servers.
Mappers are parameterized to take into account properties such as varying dataset location. In order to access a dataset, we need to know three things: its type, its mapping, and the value(s) of any parameter(s) associated with the mapping descriptor. For example, if we want to describe a dataset, of type imagefile, and whose physical representation is a file called "file1.bin" located at "/home/yongzh/data/", then the dataset might be declared as follows:
imagefile f1<single_file_mapper;file="/home/yongzh/data/file1.bin">
The above example declares a dataset called f1, which uses a single file mapper to map a file from a specific location.
SwiftScript has a simplified syntax for this case, since single_file_mapper is frequently used:
binaryfile f1<"/home/yongzh/data/file1.bin">
Swift comes with a number of mappers that handle common mapping patterns. These are documented in the mappers section of this guide.
Procedures in swift are by default executed in parallel. If five separate procedures are invoked, Swift will attempt to run them all at once.
The main exception to this is when one procedure produces a dataset as an output and another procedure uses that dataset as an input. In that case, the second procedure will be executed after the first procedure has produced the intermediate dataset.
The SwiftScript type system consists of a number of simple types, marker types for files, arrays, and complex types composed of these types.
The simple types are: string, float, int and boolean.
Complex types are specified using the type keyword. The syntax is similar to struct in C or class in Java. For example, the below example declares a complex type with two members, a string called name and an integer called age.
type person {
string name;
int age;
}
When referring to files on disk, the internal structure of the file is irrelevant; but it is still useful to declare types for those files so that Swift can perform type-checking on SwiftScript programs. In this case, a marker type can be declared, like this:
type binaryfile;
Variables in SwiftScript are declared to be of a specific type. Assignments to those variables must be data of that type. SwiftScript variables are single-assignment - a value may be assigned to a variable at most once. This assignment can happen at declaration time or later on in execution. When an attempt to read from a variable that has not yet been assigned is made, the code performing the read is suspended until that variable has been written to. This forms the basis for Swift's ability to parallelise execution - all code will execute in parallel unless there are variables shared between the code that cause sequencing.
Variable declaration statements declare new variables. They can optionally assign a value to them or map those variables to on-disk files.
Declaration statements have the general form:
typename variablename (<mapping> | = initialValue ) ;
The format of the mapping expression is defined in the Mappers section. initialValue may be either an expression or a procedure call that returns a single value.
Variables can also be declared in a multivalued-procedure statement, described in another section.
Assignment statements assign values to previously declared variables. Assignments may only be made to variables that have not already been assigned. Assignment statements have the general form:
variable = value;
where value can be either an expression or a procedure call that returns a single value.
Variables can also be assigned in a multivalued-procedure statement, described in another section.
Datasets are operated on by procedures, which take input in the form of mapped variables, perform computations, and produce typed data as output that is again mapped to variables.
There are two kinds of procedure: An atomic procedure, which describes how an external program can be executed; and compound procedures which consist of a sequence of SwiftScript statements.
A procedure declaration defines the name of a procedure and its input and output parameters. SwiftScript procedures can take multiple inputs and produce multiple outputs. Inputs are specified to the right of the function name where outputs to the left. For instance:
(type3 out1, type4 out2) myproc (type1 in1, type2 in2)
The above example declares a procedure called myproc, which has two inputs in1 (of type type1) and in2 (of type type2) and two outputs out1 (of type type3) and out2 (of type type4).
A procedure input parameter can be an optional parameter in which case it must be declared with a default value. When we call a procedure, passing in the actual parameters, we allow both positional parameter and named parameter passing, provided that all optional parameters have to be declared after the required parameters and any optional parameter has to be bound using keyword parameter passing. So for instance if we declare a procedure myproc1:
(binaryfile bf) myproc1 (int i, string s="foo")
Then the procedure can be called like this
binaryfile mybf = myproc1(1);
or like this supplying the value for the optional parameter s:
binaryfile mybf = myproc1 (1, s="bar");
The body of an atomic procedure specifies how to invoke an external executable program or Web Service, and how logical data types are mapped to command line arguments. A complete specification for myproc1 can be:
(binaryfile bf) myproc1 (int i, string s="foo") {
app {
myapp1 i s @filename(bf);
}
}
which specifies that myproc1 invokes an executable called myapp1, passing the values of i, s and the file name of bf as command line arguments. The @filename notation serves as a function denoting that the argument should be mapped as a file name, and since the notation is often required in invoking applications, a shorter syntax is defined where we can omit the filename part and use the @ sign only.
A compound procedure contains a set of calls to other procedures. Shared variables in the body of a compound procedure specify data dependencies and thus the execution sequence of the procedure calls. For simple illustration, we define a compound procedure in below:
(type2 b) foo_bar (type1 a) {
type3 c;
c = foo(a); // c holds the result of foo
b = bar(c); // c is an input to bar
}
The syntax of SwiftScript has a superficial resemblance to C and Java. For example, { and } characters are used to enclose blocks of statements.
A SwiftScript program consists of a number of statements. Statements may declare types, procedures and variables, assign values to variables, and express operations over arrays.
Procedures can return more than one value. In such case, the previously mentioned declaration and assignment statements are insufficient. A multi-valued procedure invocation can be used. This has the general form:
'(' ((type)? variableName ( '=' binding ))+ ')' = procedureinvocation
Variables can be either declared (if a type is included) or assigned (if a type is not included). If no bindings are specified, then variables are assigned in the same order that they are specified in the procedure declaration. If bindings are specified, then variables are assigned to the named return parameter.
SwiftScript provides if, switch, foreach, and while constructs, with syntax and semantics similar to comparable constructs in other high-level languages.
The foreach construct is used to apply a block of statements to each element in an array. For example:
check_order (file a[]) {
foreach f in a {
compute(f);
}
}
foreach statements have the general form:
foreach controlvariable (,index) in expression {
statements
}
The block of statements is evaluated once for each element in 'expression', with controlvariable set to the corresponding element and index set to the integer position in the array that is being iterated over.
The 'if' statement allows one of two blocks of statements to be executed, based on a boolean predicate. 'if' statements generally have the form:
if(predicate) {
statements
} else {
statements
}
where predicate is a boolean expression.
Switch expressions allow one of a selection of blocks to be chosen based on the value of a numerical control expression. Switch statements take the general form:
switch(controlExpression) {
case n1:
statements2
case n2:
statements2
[...]
default:
statements
}
The control expression is evaluated and the resulting numerical value used to select a corresponding case, and the statements belonging to that case are evaluated. If no case corresponds, then the statements belonging to the default block are evaluated.
Unlike C or Java switch statements, execution does not fall through to subsequent case blocks, and no break statement is necessary at the end of each block.
Iterate expressions allow a block of code to be evaluated repeatedly, with an integer parameter sweeping upwards from 0 until a termination condition holds.
The general form is:
iterate var {
statements;
} until (terminationExpression);
with the variable var starting at 0 and increasing each iteration. That variable is in scope in the statements block and when evaluating the termination expression.
The following infix operators are available for use in SwiftScript expressions.
+ numeric addition; string concatenation
- numeric subtraction
* numeric multiplication
/ floating point division
%/ integer division
%% integer remainder-of-division
== != comparison and not-comparison
< > <= >=
&& || boolean and, or
! boolean not
Mappers provide a mechanism to specify the layout of mapped datasets on disk. This is needed when Swift must access files to transfer them to remote sites for execution or to pass to applications.
Swift provides a number of mappers that are useful in common cases. This section details those standard mappers. For more complex cases, it is possible to write application-specific mappers in Java and use them within a SwiftScript program. For more information on writing an application specific mapper, see the tutorial module on mappers.
- Name: single_file_mapper
- Description:
A single file mapper maps a single physical file to a dataset.
Swift variable -------------------> Filename f myfile f[0] INVALID f.bar INVALID - Parameter:
- file: The location of the physical file including path and file name.
- Example:
file f <single_file_mapper;file="plot_outfile_param">;
There is a simplified syntax for this mapper:file f <"plot_outfile_param">;
- Name: simple_mapper
- Description: A mapper that maps a file or a list of files into an array by prefix, suffix, and pattern. If more than one file is matched, each of the file names will be mapped as a subelement of the dataset.
- Parameters:
- location: A directory that the files are located.
- prefix: The prefix of the files
- suffix: The suffix of the files, for instance: ".txt"
- pattern: A UNIX glob style pattern, for instance: "*foo*" would match all file names that contain foo. When this mapper is used to specify output filenames, pattern is ignored.
- Examples:
type file; file f <simple_mapper;prefix="foo", suffix=".txt">;
The above maps all filenames that start withfooand have an extension.txtinto file f.Swift variable -------------------> Filename f foo.txt type messagefile; (messagefile t) greeting(string m) {. app { echo m stdout=@filename(t); } } messagefile outfile <simple_mapper;prefix="foo",suffix=".txt">; outfile = greeting("hi");This will output the string 'hi' to the filefoo.txt.-
The simple_mapper can be used to map arrays. It will map the array index
into the filename between the prefix and suffix.
type messagefile; (messagefile t) greeting(string m) { app { echo m stdout=@filename(t); } } messagefile outfile[] <simple_mapper;prefix="baz",suffix=".txt">; outfile[0] = greeting("hello"); outfile[1] = greeting("middle"); outfile[2] = greeting("goodbye");Swift variable -------------------> Filename outfile[0] baz0000.txt outfile[1] baz0001.txt outfile[2] baz0002.txt -
simple_mapper can be used to map structures. It will map the name of the
structure member into the filename, between the prefix and the
suffix.
type messagefile; type mystruct { messagefile left; messagefile right; }; (messagefile t) greeting(string m) { app { echo m stdout=@filename(t); } } mystruct out <simple_mapper;prefix="qux",suffix=".txt">; out.left = greeting("hello"); out.right = greeting("goodbye");This will output the string "hello" into the filequx.left.txtand the string "goodbye" into the filequx.right.txt.Swift variable -------------------> Filename out.left quxleft.txt out.right quxright.txt
- Name: concurrent_mapper
- Description: Concurrent mapper is almost the same as the simple mapper, except that it is used to map an output file, and the filename generated will contain an extract sequence that is unique. This mapper is the default mapper for variables when no mapper is specified.
- Parameters:
- location: A directory that the files are located.
- prefix: The prefix of the files
- suffix: The suffix of the files, for instance: ".txt"
- pattern: A UNIX glob style pattern, for instance: "*foo*" would match all file names that contain foo. When this mapper is used to specify output filenames, pattern is ignored.
- Example:
file f1; file f2 <concurrent_mapper;prefix="foo", suffix=".txt">;
The above example would use concurrent mapper for f1 and f2, and generate f2 filename with prefix "foo" and extension ".txt"
TODO: note on difference between location as a relative vs absolute path wrt staging to remote location - as mihael said: It's because you specify that location in the mapper. Try location="." instead of location="/sandbox/..."
- Name: filesys_mapper
- Description: This mapper is similar to the simple mapper, but maps a file or a list of files to an array. Each of the filename is mapped as an element in the array. The order of files in the resulting array is not defined.
- Parameters:
- location: The directory where the files are located.
- prefix: The prefix of the files
- suffix: The suffix of the files, for instance: ".txt"
- pattern: A UNIX glob style pattern, for instance: "*foo*" would match all file names that contain foo.
- Example:
file texts[] <filesys_mapper;prefix="foo", suffix=".txt">;
The above example would map all filenames that start with "foo" and have an extension ".txt" into the array texts. For example, if the specified directory contains files: foo1.txt, footest.txt, foo__1.txt, then the mapping might be:Swift variable -------------------> Filename texts[0] footest.txt texts[1] foo1.txt texts[2] foo__1.txt
- Name: fixed_array_mapper
- Description: This mapper maps from a string that contains a list of filenames into a file array.
- Parameter:
- files: A string that contains a list of filenames, separated by space, comma or colon
- Example:
file texts[] <fixed_array_mapper;files="file1.txt, fileB.txt, file3.txt">;
would cause a mapping like this:Swift variable -------------------> Filename texts[0] file1.txt texts[1] fileB.txt texts[2] file3.txt
- Name: array_mapper
- Description: This mapper froms an array of strings into a file array.
- Parameter:
- files: An array of strings containing one filename per element
- Example:
string s[] = [ "a.txt", "b.txt", "c.txt" ]; file f[] <array_mapper;files=s>;
This will establish the mapping:Swift variable -------------------> Filename f[0] a.txt f[1] b.txt f[2] c.txt
- Name: regexp_mapper
- Description: This mapper transforms one file name to another using regular expression matching.
- Parameters:
- source: The source file name
- match: Regular expression pattern to match, use ( ) to match whatever regular expression is inside the parentheses, and indicate the start and end of a group; the contents of a group can be retrieved with the \number special sequence
- transform: The pattern of the file name to transform to, use \number to reference the group matched.
- Example:
string s = "picture.gif"; file f <regexp_mapper;source=s,match="(.*)gif",transform="\1jpg">;
This example transforms a string "gif" into one ending with "jpg" and maps that to a file.Swift variable -------------------> Filename f picture.jpg
- Name: csv_mapper
- Description:
This mapper maps the content of a CSV (comma-separated value) file into
an array of structures. The dataset type needs to be correctly defined to
conform to the column names in the file. For instance, if the file
contains columns:
name age GPA
then the type needs to have the same member elements, say:type student { File name; File age; File GPA; }If the file does not contain a header with column info, then the column names are assumed as "column1", "column2", etc. - Parameters:
- file: The name of the CSV file to read mappings from.
- header: Whether the file has a line describing header info; default is true
- skip: The number of lines to skip at the beginning (after header line); default is 0.
- hdelim: Header field delimiter; default is the value of the "delim" parameter
- delim: Content field delimiters; defaults are space, tab and comma.
- Example:
student stus[] <csv_mapper;file="stu_list.txt">;
The above example would read a list of student info from file "stu_list.txt" and map them into a student array. By default, the file should contain a header line specifying the names of the columns. Ifstu_list.txtcontains the following:name,age,gpa 101-name.txt, 101-age.txt, 101-gpa.txt name55.txt, age55.txt, age55.txt q, r, s
then some of the mappings produced by this example would be:Swift variable -------------------> Filename stus[0].name 101-name.txt stus[0].age 101-age.txt stus[0].gpa 101-gpa.txt stus[1].name name55.txt stus[1].age age55.txt stus[1].gpa gpa55.txt stus[2].name q stus[2].age r stus[2].gpa s
- Name: ext
- Description: This mapper maps based on the output of a supplied Unix executable.
- Parameters:
- exec: The name of the executable (relative to the current directory, if an absolute path is not specified)
- Other parameters are passed to the executable prefixed by a - symbol.
- The output of the executable should consist of two columns of data, separated by a space. The first column should be the path of the mapped variable, in SwiftScript syntax (for example [2] means the 2nd element of an array) or the symbol $ to represent the root of the mapped variable.
- Example:
With the following in mapper.sh,
#!/bin/bash echo "[2] qux" echo "[0] foo" echo "[1] bar"
then a mapping statement:student stus[] <ext;exec="mapper.sh">;
would mapSwift variable -------------------> Filename stus[0] foo stus[1] bar stus[2] qux
Various aspects of the behavior of the Swift Engine can be
configured through properties. The Swift Engine recognizes a global,
per installation properties file which can found in $SWIFT_HOME/etc/swift.properties and a user
properties file which can be created by each user in ~/.swift/swift.properties. The Swift Engine
will first load the global properties file. It will then try to load
the user properties file. If a user properties file is found,
individual properties explicitly set in that file will override the
respective properties in the global properties file. Furthermore,
some of the properties can be overridden directly using command line
arguments to the swift command.
Swift properties are specified in the following format:
<name>=<value>
The value can contain variables which will be expanded when the
properties file is read. Expansion is performed when the name of
the variable is used inside the "standard" shell dereference
construct: ${name}. The following variables
can be used in the Swift configuration file:
Swift Configuration Variables
-
swift.home Points to the Swift installation directory (
).$SWIFT_HOME-
user.name The name of the current logged in user.
-
user.home The user's home directory.
The following is a list of valid Swift properties:
Swift Properties
- sites.file
Valid values:
<file>Default value: ${
swift.home}/etc/sites.xmlPoints to the location of the site catalog, which contains a list of all sites that Swift should use.
- tc.file
Valid values:
<file>Default value: ${
swift.home}/etc/tc.dataPoints to the location of the transformation catalog file which contains information about installed applications. Details about the format of the transformation catalog can be found here.
- ip.address
Valid values:
<ipaddress>Default value: N/A
The Globus GRAM service uses a callback mechanism to send notifications about the status of submitted jobs. The callback mechanism requires that the Swift client be reachable from the hosts the GRAM services are running on. Normally, Swift can detect the correct IP address of the client machine. However, in certain cases (such as the client machine having more than one network interface) the automatic detection mechanism is not reliable. In such cases, the IP address of the Swift client machine can be specified using this property. The value of this property must be a numeric address without quotes.
- lazy.errors
Valid values:
true,falseDefault value:
falseSwift can report application errors in two modes, depending on the value of this property. If set to
false, Swift will report the first error encountered and immediately stop execution. If set totrue, Swift will attempt to run as much as possible from a workflow before stopping execution and reporting all errors encountered.When developing workflows, using the default value of
falsecan make the workflow easier to debug. However in production runs, usingtruewill allow more of a workflow to be run before Swift aborts execution.- caching.algorithm
Valid values:
LRUDefault value:
LRUSwift caches files that are staged in on remote resources, and files that are produced remotely by applications, such that they can be re-used if needed without being transfered again. However, the amount of remote file system space to be used for caching can be limited using the swift:storagesize profile entry in the sites.xml file. Example:
<pool handle="example" sysinfo="INTEL32::LINUX"> <gridftp url="gsiftp://example.org" storage="/scratch/swift" major="2" minor="4" patch="3"/> <jobmanager universe="vanilla" url="example.org/jobmanager-pbs" major="2" minor="4" patch="3"/> <workdirectory>/scratch/swift</workdirectory> <profile namespace="SWIFT" key="storagesize">20000000</profile> </pool>
The decision of which files to keep in the cache and which files to remove is made considering the value of the caching.algorithm property. Currently, the only available value for this property is
LRU, which would cause the least recently used files to be deleted first.- execution.retries
Valid values: positive integers
Default value: 2
The number of time a job will be retried if it fails (giving a maximum of 1 + execution.retries attempts at execution)
- pgraph
Valid values:
true,false,<file>Default value:
falseSwift can generate a Graphviz file representing the structure of the workflow it runs. If this property is set to
true, Swift will save the provenance graph in a file named by concatenating the workflow name and the instance ID (e.g.helloworld-ht0adgi315l61.dot).If set to
false, no provenance graph will be generated. If a file name is used, then the provenance graph will be saved in the specified file.The generated dot file can be rendered into a graphical form using Graphviz, for example with a command-line such as:
swift -pgraph graph1.dot q1.swift dot -ograph.png -Tpng graph1.dot
- pgraph.graph.options
Valid values:
<string>Default value:
splines="compound", rankdir="TB"This property specifies a Graphviz specific set of parameters for the graph.
- pgraph.node.options
Valid values:
<string>Default value:
color="seagreen", style="filled"Used to specify a set of Graphviz specific properties for the nodes in the graph.
- clustering.enabled
Valid values:
true,falseDefault value:
falseEnables clustering.
- clustering.queue.delay
Valid values:
<int>Default value:
4This property indicates the interval, in seconds, at which the clustering queue is processed.
- clustering.min.time
Valid values:
<int>Default value:
60Indicates the threshold wall time for clustering, in seconds. Jobs that have a wall time smaller than the value of this property will be considered for clustering.
- kickstart.enabled
Valid values:
true,false,maybeDefault value:
maybeThis option allows controlling of when Swift uses Kickstart. A value of
falsedisables the use of Kickstart, while a value oftrueenables the use of Kickstart, in which case sites specified in thesites.xmlfile must have validgridlaunchattributes. Themaybevalue will enable the use of Kickstart only on sites that have thegridlaunchattribute specified.- kickstart.always.transfer
Valid values:
true,falseDefault value:
falseThis property controls when output from Kickstart is transfered back to the submit site, if Kickstart is enabled. When set to
false, Kickstart output is only transfered for jobs that fail. If set totrue, Kickstart output is transfered after every job is completed or failed.- wrapperlog.always.transfer
Valid values:
true,falseDefault value:
falseThis property controls when output from the Swift remote wrapper is transfered back to the submit site. When set to
false, wrapper logs are only transfered for jobs that fail. If set totrue, wrapper logs are transfered after every job is completed or failed.- throttle.submit
Valid values:
<int>,offDefault value:
4Limits the number of concurrent submissions for a workflow instance. This throttle only limits the number of concurrent tasks (jobs) that are being sent to sites, not the total number of concurrent jobs that can be run. The submission stage in GRAM is one of the most CPU expensive stages (due mostly to the mutual authentication and delegation). Having too many concurrent submissions can overload either or both the submit host CPU and the remote host/head node causing degraded performance.
- throttle.host.submit
Valid values:
<int>,offDefault value:
2Limits the number of concurrent submissions for any of the sites Swift will try to send jobs to. In other words it guarantees that no more than the value of this throttle jobs sent to any site will be concurrently in a state of being submitted.
- throttle.transfers
Valid values:
<int>,offDefault value:
4Limits the total number of concurrent file transfers that can happen at any given time. File transfers consume bandwidth. Too many concurrent transfers can cause the network to be overloaded preventing various other signaling traffic from flowing properly.
- throttle.file.operations
Valid values:
<int>,offDefault value:
8Limits the total number of concurrent file operations that can happen at any given time. File operations (like transfers) require an exclusive connection to a site. These connections can be expensive to establish. A large number of concurrent file operations may cause Swift to attempt to establish many such expensive connections to various sites. Limiting the number of concurrent file operations causes Swift to use a small number of cached connections and achieve better overall performance.
- throttle.score.job.factor
Valid values:
<int>,offDefault value:
4The Swift scheduler has the ability to limit the number of concurrent jobs allowed on a site based on the performance history of that site. Each site is assigned a score (initially 1), which can increase or decrease based on whether the site yields successful or faulty job runs. The score for a site can take values in the (0.1, 100) interval. The number of allowed jobs is calculated using the following formula:
2 + score*throttle.score.job.factor
This means a site will always be allowed at least two concurrent jobs and at most 2 + 100*throttle.score.job.factor. With a default of 4 this means at least 2 jobs and at most 402.
This parameter can also be set per site using the jobThrottle profile key in a site catalog entry.
- sitedir.keep
Valid values:
true,falseDefault value:
falseIndicates whether the working directory on the remote site should be left intact even when the workflow completes successfully. This can be used to inspect the site working directory for debugging purposes.
Example:
sites.file=${vds.home}/etc/sites.xml
tc.file=${vds.home}/etc/tc.data
ip.address=192.168.0.1
The swift command is the main command line tool for executing SwiftScript programs.
The swift command is invoked as follows: swift [options] SwiftScript-program [SwiftScript-arguments] with options taken from the following list, and SwiftScript-arguments made available to the SwiftScript program through the @arg function.
Swift command-line options
- -help or -h
Display usage information
- -typecheck
Does a typecheck instead of executing the workflow
- -dryrun
Runs the workflow without submitting any jobs (can be used to get a graph)
- -monitor
Shows a graphical resource monitor
- -resume
file Resumes the execution using a log file
- -config
file Indicates the Swift configuration file to be used for this run. Properties in this configuration file will override the default properties. If individual command line arguments are used for properties, they will override the contents of this file.
- -verbose | -v
Increases the level of output that Swift produces on the console to include more detail about the execution
- -debug | -d
Increases the level of output that Swift produces on the console to include lots of detail about the execution
- -logfile
file Specifies a file where log messages should go to. By default Swift uses the name of the workflow being run and a numeric index (e.g. myworkflow.1.log)
- -runid
identifier Specifies the run identifier. This must be unique for every invocation of a workflow and is used in several places to keep files from different executions cleanly separated. By default, a datestamp and random number are used to generate a run identifier. When using this parameter, care should be taken to ensure that the run ID remains unique with respect to all other run IDs that might be used, irrespective of (at least) expected run location, workflow or user.
- -tcp.port.range
start,end A TCP port range can be specified to restrict the ports on which GRAM callback services are started. This is likely needed if your submit host is behind a firewall, in which case the firewall should be configured to allow incoming connections on ports in the range.
In addition, the following Swift properties can be set on the command line:
- caching.algorithm
- clustering.enabled
- clustering.enabled
- clustering.min.time
- clustering.queue.delay
- ip.address
- kickstart.always.transfer
- kickstart.enabled
- lazy.errors
- pgraph
- pgraph.graph.options
- pgraph.node.options
- sitedir.keep
- sites.file
- tc.file
Kickstart is a tool that can be used to gather various information about the remote execution environment for each job that Swift tries to run.
For each job, Kickstart generates an XML invocation record. By default this record is staged back to the submit host if the job fails.
Before it can be used it must be installed on the remote site and the sites file must be configured to point to kickstart.
Kickstart can be downloaded as part of the Pegasus 'worker package' available from the worker packages section of the Pegasus download page.
Untar the relevant worker package somewhere where it is visible to all of the worker nodes on the remote execution machine (such as in a shared application filesystem).
Now configure the gridlaunch attribute of the sites catalog
to point to that path, by adding a gridlaunch
attribute to the pool element in the site
catalog:
<pool handle="example" gridlaunch="/usr/local/bin/kickstart" sysinfo="INTEL32::LINUX"> [...] </pool>
There are various kickstat.* properties, which have sensible default values. These are documented in the properties section.
If a workflow fails, Swift can resume that workflow from the point of failure. If a Swift workflow fails, a restart log file will be generated using the unique job ID, with a .rlog extension. This restart log can then be passed to a subsequent Swift invocation using the -resume parameter. Swift will resume executing the workflow. Previously executed tasks will not be run a second time. The SwiftScript source file should not be modified between invocations.
Every execution of a workflow creates a restart
log file with a named composed of the file name of the workflow
being executed, an invocation ID, a numeric ID, and the .rlog extension. For example, example.swift, when executed, could produce
the following restart log file: example-ht0adgi315l61.0.rlog. Normally, if
the workflow completes successfully, the restart log file is
deleted. If however the workflow fails, swift
can use the restart log file to continue the
execution of the workflow from a point before the
failure occurred. In order to restart a workflow from a restart log
file, the -resume argument can be
used after the compiled workflow file name. Example:
logfile
>swift-resumeexample-ht0adgi315l61.0.rlog.example.swift
There are certain requirements on the behaviour of application programs used in SwiftScript programs. These requirements are primarily to ensure that the Swift can run your application in different places.
Swift must know about all of your data files - when Swift has decided where to run your application, it will transfer the necessary input files there before execution and transfer the output files back to the submitting system afterwards. If Swift does not know about your files, then it cannot do this. The way to tell Swift about files is by mapping them to variables and using those variables as parameters to your application.
Applications should take the name of input and output files on the command line - Sometimes Swift will decide on the name of your input and output files automatically (for example, if you do not specify a mapping explicitly for an input or output variable). Swift must be able to tell your application which filename it has chosen, and the commandline is the way it does that. Use the @filename function to determine the filename of a variable.
Applications should not assume that they are running in a particular location or on a particular host - Swift will decide which site to run a job on automatically (based on the sites that it knows have the application installed, by looking at the transformation catalog). On that site, it will create a unique working directory every time that it runs your jobs. Your job should expect to be run in an arbitrary working directory on any of the available hosts.
Running your application on the same input files multiple times should always give equivalent output files. Swift expects to be able to run a job multiple times, perhaps on the same site, perhaps on different sites, in order to deal with error conditions. For example, applications should not make modifications to external databases that causes their output to differ if they are run more than once.
This section attempts to provide a technical overview of the Swift architecture.
The execution layer causes an application program (in the form of a unix executable) to be executed either locally or remotely.
The two main choices are local unix execution and execution through GRAM. Other options are available, and user provided code can also be plugged in.
The kickstart utility can be used to capture environmental information at execution time to aid in debugging and provenance capture.
Step i: text to XML intermediate form parser/processor. parser written in ANTLR - see resources/VDL.g. The XML Schema Definition (XSD) for the intermediate language is in resources/XDTM.xsd.
Step ii: XML intermediate form to Karajan workflow. Karajan.java - reads the XML intermediate form. compiles to karajan workflow language - for example, expressions are converted from SwiftScript syntax into Karajan syntax, and function invocations become karajan function invocations with various modifications to parameters to accomodate return parameters and dataset handling.
Swift is extensible in a number of ways. It is possible to add mappers to accomodate different filesystem arrangements, site selectors to change how Swift decides where to run each job, and job submission interfaces to submit jobs through different mechanisms.
A number of mappers are provided as part of the Swift release and documented in the mappers section. New mappers can be implemented in Java by implementing the org.griphyn.vdl.mapping.Mapper interface. The Swift tutorial contains a simple example of this.
Swift provides a default site selector, the Adaptive Scheduler. New site selectors can be plugged in by implementing the org.globus.cog.karajan.scheduler.Scheduler interface and modifying libexec/scheduler.xml and etc/karajan.properties to refer to the new scheduler.
Execution providers, which allow to Swift to execute jobs through specific mechanisms (such as local fork or through GRAM) can be implemented as Java CoG kit providers.
This section details functions that are available for use in the SwiftScript language.
Takes a command line parameter name as a string parameter and an optional default value and returns the value of that string parameter from the command line. If no default value is specified and the command line parameter is missing, an error is generated. If a default value is specified and the command line parameter is missing, @arg will return the default value.
Command line parameters recognized by @arg begin with exactly one hyphen and need to be positioned after the script name.
For example:
trace(@arg("myparam"));
trace(@arg("optionalparam", "defaultvalue"));
$ swift arg.swift -myparam=hello
Swift v0.3-dev r1674 (modified locally)
RunID: 20080220-1548-ylc4pmda
SwiftScript trace: defaultvalue
SwiftScript trace: hello
@extractint(file) will read the specified file, parse an integer from the file contents and return that integer.
@filename(v) will return a string containing the filename(s) for the file(s) mapped to the variable v. When more than one filename is returned, the filenames will be space separated inside a single string return value.
@filenames(v) will return multiple values (!) containing the filename(s) for the file(s) mapped to the variable v. (compare to @filename)
@regexp(input,pattern,replacement) will apply regular expression substitution using the Java java.util.regexp API. For example:
string v = @regexp("abcdefghi", "c(def)g","monkey");
will assing the value "abmonkeyhi" to the variable v.
@strcat(a,b,c,d,...) will return a string containing all of the strings passed as parameters joined into a single string. There may be any number of parameters.
The + operator concatenates two strings: @strcat(a,b) is the same as a + b
@strcut(input,pattern) will match the regular expression in the pattern parameter against the supplied input string and return the section that matches the first matching parenthesised group.
For example:
string t = "my name is John and i like puppies.";
string name = @strcut(t, "my name is ([^ ]*) ");
string out = @strcat("Your name is ",name);
print(out);
will output the message 'Your name is John'.
This section details built-in procedures that are available for use in the SwiftScript language.
readData will read data from a specified file.
The format of the input file is controlled by the type of the return value.
For scalar return types, such as int, the specified file should contain a single value of that type.
For arrays of scalars, the specified file should contain one value per line.
For structs of scalars, the file should contain two rows. The first row should be structure member names separated by whitespace. The second row should be the corresponding values for each structure member, separated by whitespace, in the same order as the header row.
For arrays of structs, the file should contain a heading row listing structure member names separated by whitespace. There should be one row for each element of the array, with structure member elements listed in the same order as the header row and separated by whitespace.
This procedure is new in 0.4.
Deprecated - use trace instead.
print will print its parameters to stdout; but will do this at a point in execution that is undefined. Specifically, it will not necessarily wait for its parameters to be properly set.
maxSubmitRate - limits the maximum rate of job submission, in jobs per second. For example:
<profile namespace="karajan" key="maxSubmitRate">0.2</profile>
will limit job submission to 0.2 jobs per second (or equivalently, one job every five seconds).
jobThrottle - allows the job throttle factor (see Swift property throttle.score.job.factor) to be set per site.
initialScore - allows the initial score for rate limiting and site selection to be set to a value other than 0.
delayBase - controls how much a site will be delayed when it performs poorly. With each reduction in a sites score by 1, the delay between execution attempts will increase by a factor of delayBase.
storagesize limits the amount of space that will be used on the remote site for temporary files. When more than that amount of space is used, the remote temporary file cache will be cleared using the algorithm specified in the caching.algorithm property.
maxwalltime specifies a walltime limit for each job, in minutes. This profile setting also interacts with the clustering mechanism.
The following formats are recognized:
- Minutes
- Hours:Minutes
- Hours:Minutes:Seconds
queue is used by the PBS, GRAM2 and GRAM4 providers. This profile entry specifies which queue jobs will be submitted to. The valid queue names are site-specific.
host_types specifies the types of host that are permissible for a job to run on. The valid values are site-specific. This profile entry is used by the GRAM2 and GRAM4 providers.
condor_requirements allows a requirements string to be specified when Condor is used as an LRM behind GRAM2. Example: <profile namespace="globus" key="condor_requirements">Arch == "X86_64" || Arch="INTEL"</profile>
coastersPerNode specifies the number of coaster workers to be run on each node. This profile entry is used by the coaster provider.
Swift can group a number of short job submissions into a single larger job submission to minimize overhead involved in launching jobs (for example, caused by security negotiation and queuing delay).
By default, clustering is disabled. It can be activated by setting the clustering.enabled property to true.
A job is eligible for clustering if
the GLOBUS::maxwalltime profile is specified in the tc.data entry for that job, and its value is
less than the value of the
clustering.min.time
property.
Two or more jobs are considered compatible if they share the same site and do not have conflicting profiles (e.g. different values for the same environment variable).
When a submitted job is eligible for clustering, it will be put in a clustering queue rather than being submitted to a remote site. The clustering queue is processed at intervals specified by the clustering.queue.delay property. The processing of the clustering queue consists of selecting compatible jobs and grouping them into clusters whose maximum wall time does not exceed twice the value of the clustering.min.time property.
If you have a UChicago Computation Institute account, run this command in your submit directory after each run. It will copy all your logs and kickstart records into a directory at the CI for reporting, usage tracking, support and debugging.
rsync --ignore-existing *.log *.d login.ci.uchicago.edu:/home/benc/swift-logs/ --verbose
TeraGrid users with no default project or with several project allocations can specify a project allocation using a profile key in the site catalog entry for a TeraGrid site:
<profile namespace="globus" key="project">TG-CCR080002N</profile>
More information on the TeraGrid allocations process can be found here.
Here is an example of running a simple MPI program.
In SwiftScript, we make an invocation that does not look any different from any other invocation. In the below code, we do not have any input files, and have two output files on stdout and stderr:
type file;
(file o, file e) p() {
app {
mpi stdout=@filename(o) stderr=@filename(e);
}
}
file mpiout <"mpi.out">;
file mpierr <"mpi.err">;
(mpiout, mpierr) = p();
Now we define how 'mpi' will run in tc.data:
tguc mpi /home/benc/mpi/mpi.sh INSTALLED INTEL32::LINUX GLOBUS::host_xcount=3
mpi.sh is a wrapper script that launches the MPI program. It must be installed on the remote site:
#!/bin/bash mpirun -np 3 -machinefile $PBS_NODEFILE /home/benc/mpi/a.out
Because of the way that Swift runs its server side code, provider-specific MPI modes (such as GRAM jobType=mpi) should not be used. Instead, the mpirun command should be explicitly invoked.
The site catalog lists details of each site that Swift can use. The default file contains one entry for local execution, and a large number of commented-out example entries for other sites.
By default, the site catalog is stored in etc/sites.xml.
This path can be overridden with the sites.file configuration property,
either in the Swift configuration file or on the command line.
The sites file is formatted as XML. It consists of <pool> elements, one for each site that Swift will use.
Each pool element must have a handle attribute, giving a symbolic name for the site. This can be any name, but must correspond to entries for that site in the transformation catalog.
Optionally, the gridlaunch attribute can be used to specify the path to kickstart on the site.
Each pool must specify a file transfer method, an execution method and a remote working directory. Optionally, profile settings can be specified.
Transfer methods are specified with the <gridftp> element or with the <filesystem> method.
To use gridftp or local filesystem copy, use the <gridftp> element:
<gridftp url="gsiftp://evitable.ci.uchicago.edu" />
The URL attribute may specify a GridFTP server, using the gsiftp URI scheme; or it may specify that filesystem copying will be used (which assumes that the site has access to the same filesystem as the submitting machine) using the URI local://localhost.
Filesystem access using scp (the SSH copy protocol) can be specified using the <filesystem> element:
<filesystem url="www11.i2u2.org" provider="ssh"/>
For additional ssh configuration information, see the ssh execution provider documentation below.
Execution methods may be specified either with a <jobmanager> or <execution> element.
The <jobmanager> element can be used to specify execution through GRAM2. For example,
<jobmanager universe="vanilla" url="evitable.ci.uchicago.edu/jobmanager-fork" major="2" />
The universe attribute should always be set to vanilla. The url attribute should specify the name of the GRAM2 gatekeeper host, and the name of the jobmanager to use. The major attribute should always be set to 2.
The <execution> element can be used to specify execution through other execution providers:
To use GRAM4, specify the gt4 provider. For example:
<execution provider="gt4" jobmanager="PBS" url="tg-grid.uc.teragrid.org" />
The url attribute should specify the GRAM4 submission site. The jobmanager attribute should specify which GRAM4 jobmanager will be used.
For local execution, the local provider should be used, like this:
<execution provider="local" url="none" />
For PBS execution, the pbs provider should be used:
<execution provider="pbs" url="none" />
The GLOBUS::queue profile key can be used to specify which PBS queue jobs will be submitted to.
For execution through SSH, the ssh provider should be used:
<execution url="www11.i2u2.org" provider="ssh"/>
with configuration made in ~/.ssh/auth.defaults with
the string 'www11.i2u2.org' changed to the appropriate host name:
www11.i2u2.org.type=key www11.i2u2.org.username=hategan www11.i2u2.org.key=/home/mike/.ssh/i2u2portal www11.i2u2.org.passphrase=XXXX
For execution using the CoG Coaster mechanism, the coaster provider should be used:
<execution provider="coaster" url="tg-grid.uc.teragrid.org"
jobmanager="gt2:gt2:pbs" />
with the jobmanager parameter specifying: the cog provider to use to submit the coaster head job; the cog provider to use to submit coaster worker jobs; and optionally the jobmanager to be used by worker submission.
The workdirectory element specifies where on the site files can be stored.
<workdirectory>/home/benc</workdirectory>
This file must be accessible through the transfer mechanism specified in the <gridftp> element and also mounted on all worker nodes that will be used for execution. A shared cluster scratch filesystem is appropriate for this.
Profile keys can be specified using the <profile> element. For example:
<profile namespace="globus" key="queue">fast</profile>
The site catalog is an evolution of the VDS site catalog which is documented here.
The transformation catalog lists where application executables are located on remote sites.
By default, the site catalog is stored in etc/tc.data.
This path can be overridden with the tc.file configuration property,
either in the Swift configuration file or on the command line.
The format is one line per executable per site, with fields separated by tabs. Spaces cannot be used as a field separator.
Some example entries:
localhost echo /bin/echo INSTALLED INTEL32::LINUX null TGUC touch /usr/bin/touch INSTALLED INTEL32::LINUX GLOBUS::maxwalltime="0:1"
The fields are: site, transformation name, executable path, installation status, platform, and profile entrys.
The site field should correspond to a site name listed in the sites catalog.
The transformation name should correspond to the transformation name used in a SwiftScript app {} block.
The executable path should specify where the particular executable is located on that site.
The installation status and platform fields are not used. Set them to INSTALLED and INTEL32::LINUX respectively.
The profiles field should be set to 'null' if no profile entries are to be specified, or should contain the profile entries separated by semicolons.
There are a number of environment variables used by Swift. Some of them are documented in this section:
PATHPREFIX - set in env namespace profiles. This path is prefixed onto the start of the PATH when jobs are executed. It can be more useful than setting the PATH environment variable directly, because setting PATH will cause the execution site's default path to be lost.
GLOBUS_HOSTNAME, GLOBUS_TCP_PORT_RANGE - set in the environment before running Swift. These can be set to inform Swift of the configuration of your local firewall. More information can be found in the Globus firewall How-to.
COG_OPTS - set in the environment before running Swift. Options set in this variable will be passed as parameters to the Java Virtual Machine which will run Swift. The parameters vary between virtual machine imlementations, but can usually be used to alter settings such as maximum heap size. Typing 'java -help' will sometimes give a list of commands. The Sun Java 1.4.2 command line options are documented here.
SWIFT_JOBDIR_PATH - set in env namespace profiles. If set, then Swift will use the path specified here as a worker-node local temporary directory to copy input files to before running a job. If unset, Swift will keep input files on the site-shared filesystem. In some cases, copying to a worker-node local directory can be much faster than having applications access the site-shared filesystem directly.
See the Swift download page for instructions on downloading and building Swift from source. When building, various build options can be supplied on the ant commandline. These are summarised here:
with-provider-condor - build with CoG condor provider
with-provider-coaster - build with CoG coaster provider
with-provider-deef - build with Falkon provider deef. In order for this option to work, it is necessary to check out the provider-deef code in the cog/modules directory alongside vdsk:
$cd cog/modules$svn co https://svn.ci.uchicago.edu/svn/vdl2/provider-deef$cd ../vdsk$ant -Dwith-provider-deef=true redist
with-provider-wonky - build with provider-wonky, an execution provider that provides delays and unreliability for the purposes of testing Swift's fault tolerance mechanisms. In order for this option to work, it is necessary to check out the provider-wonky code in the cog/modules directory alongside vdsk:
$cd cog/modules$svn co https://svn.ci.uchicago.edu/svn/vdl2/provider-wonky$cd ../vdsk$ant -Dwith-provider-wonky=true redist
no-supporting - produces a distribution without supporting commands such as grid-proxy-init. This is intended for when the Swift distribution will be used in an environment where those commands are already provided by other packages, where the Swift package should be providing only Swift commands, and where the presence of commands such as grid-proxy-init from the Swift distribution in the path will mask the presence of those commands from their true distribution package such as a Globus Toolkit package.
$ ant -Dno-supporting=true redist