Advanced data access with gridftp

From Begrid Wiki
Revision as of 09:11, 9 June 2021 by Maintenance script (talk | contribs) (Created page with " === Data Access === Instead of attaching files to a job with the InputSandbox and OutputSandbox, you can also use the GridFTP system. This is especially useful when you have...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Data Access

Instead of attaching files to a job with the InputSandbox and OutputSandbox, you can also use the GridFTP system. This is especially useful when you have bigger files (> 20M). It theoretically supports terabytes, but is obviously limited to the storage capacity of the sites.

It is not meant for small files, since the transfer of an empty file takes about 2 seconds. You may however tar a bunch of files together and upload/download it as one big file.

Command overview

To use GridFTP, there are a few commands available on the User Interface and on the Worker Nodes (Peris, et al., 2005):

GridFTP commands for replica management:
lcg-cp copies a file from the grid to a local folder (download)
lcg-cr copies a local file to an SE and registers it in the file catalog (LFC or LRC) (upload)
lcg-del deletes a file (either one replica, or all replica's)
lcg-rep copies a file from one SE to another and registers this in the file catalog (LFC or LRC) (replication)
lcg-gt returns the TURL from a given SURL and a given file transfer protocol
lcg-sd sets the file state of a given SURL in an SRM request to "Done"


GridFTP commands for file catalog interaction
lcg-aa adds an alias to the file catalog (LFC or RMC) for a given GUID
lcg-ra deletes the alias for a given GUID from the file catalog (LFC or RMC)
lcg-rf registers a file on a SE in the file catalog (LFC or LRC/RMC)
lcg-uf deletes a file on a SE from the file catalog (LFC or LRC)
lcg-la returns the aliases of a given LFN, GUID or SURL
lcg-lg returns the GUID for a given LFN of SURL
lcg-lr returns the replicas for a given LFN, GUID or SURL


General terms

With GridFTP a bunch of new terms are introduced:

GUID (Grid Unique Identifier)

This is a unique and unchangeable label that is given to a file when it is registered ini the file catalog. Every copy of this file will be known by this label. Example:

 guid:ad9c9e8c-c357-4bb4-8e8f-d037dd84f914

LFN (Logical File Name)

A unique label that the user is free to choose. This label will clarify the content of this file. This can be alterd by the user. Example:

 lfn:/grid/beapps/invoerbestand

SURL (Storage URL)

This is a URL that identifies a file on a specific SE. Example:

 sfn://kg-se01.cc.kuleuven.be/storage/beapps/generated/2006-03-20/filee43431cb-0aba-4c2e-b99f-9d230181b18b

TURL (Transport URL)

This is a (temporary) URL that grants a user access to a certain file on a certain SE, using a certain transfer protocol (eg. RFIO, GSIFTP, POSIX (file://)).

Example:

 rfio://kg-se01.cc.kuleuven.be/storage/beapps/generated/2006-03-20/filee43431cb-0aba-4c2e-b99f-9d230181b18b

These URLs and IDs are linked as follows:

/Link between URLs and IDs]]


Usage

Let's demonstrate the use of GridFTP with a general experiment. We have an input file input.dat in our home directory, a program process and an output file output.dat that we want back for further processing; we want to execute the following command:

 process input.dat > output.dat

First we put the input file in the file catalog: it will be copied to the closest storage element (eg. kg-se01.cc.kuleuven.be) and gets a clear logical name (LFN):

 $ lcg-cr --vo beapps -d kg-se01.cc.kuleuven.be -l lfn:/grid/beapps/input.dat file://$(pwd)/input.dat
 guid:d91328b1-3a51-442c-ba64-c2bbe032c675

If this command succeeds, the unique id of this file is returned. From now on this file is accessible from anywhere on the grid through this GUID and through the LFN. We check to make sure the file is correctly saved on the storage element in the directory /grid/beapps with the lfc-ls command:

 $ lfc-ls /grid/beapps
 input.dat

We create a simple shell script process.sh that will do the following:

  • retrieve the input file;
  • execute the program process input.dat > output.dat;
  • put the output file available through GridFTP.

The shell script:

 export LCG_GFAL_VO=beapps
 export LFC_HOST=gridy8.begrid.be
 export VO_BEAPPS_DEFAULT_SE=kg-se01.cc.kuleuven.be
 
 lcg-cp -v lfn:/grid/beapps/input.dat file://$(pwd)/input.dat
 process input.dat > output.dat
 lcg-cr --vo beapps -d kg-se01.cc.kuleuven.be -l lfn:/grid/beapps/output.dat file://$(pwd)/output.dat

The first lines with export are there to guarantee the system works on different grid sites.

We start the job and wait till it's done. Then we can check the contents of the /grid/beapps directory through GridFTP:

 $ lfc-ls /grid/beapps
 input.dat
 output.dat

As we were hoping, we now also find output.dat. The file can be transferred back to the local file system for further processing:

 $ lcg-cp -v --vo beapps lfn:/grid/beapps/uitvoer.dat file://$(pwd)/uitvoer.dat
 Using grid catalog type: lfc
 Using grid catalog : kg-mon.cc.kuleuven.be
 Source URL: lfn:/grid/beapps/uitvoer.dat
 File size: 58
 VO name: beapps
 Source URL for copy: gsiftp://kg-
        se01.cc.kuleuven.be///dpm/cc.kuleuven.be/home/beapps/generated/
        2006-03-23/filedef91cab-a200-4721-a15b-dbefc98e4d5c
 Destination URL: file:///gegevens/home/jo/uitvoer.dat
 # streams: 1
 # set timeout to 0 (seconds)
                   0 bytes             0.00 KB/sec avg                 0.00 KB/sec inst
 Transfer took 2030 ms


Template:TracNotice