Preparing Data for OURRstore

Before data can be stored on OURRstore, a user needs to verify and prepare all of their files before storing them on the system.  This is due to a combination of NSF grant requirements and physical constraints of tape drives. This document details the data requirements of items on OURRstore, as well as some tools to help prepare data for storage.

Contents


Data Requirements for OURRstore

  • Files MUST be between 1 TB and 1 GB in size.
    • OURRstore recommends that files sizes be between 200 GB and 20 GB in size.
  • Data MUST be related to STEM research.
    • STEM includes the following disciplines (as defined by NSF)
      • Physical sciences
      • Biosciences
      • Geosciences
      • Engineering
      • Mathematics
      • Technology (for example, computer and information sciences)
      • Social sciences.
    • Non-STEM data and clinical research data MUST NOT be archived on OURRstore.
    • Human subject data MAY be allowed, provided it doesn't break any other requirements and you are permitted to by any/all Institutional Review Board (IRB) agreements surrounding it.
  • Data MUST NOT be legally regulated. These regulations include but are not limited to
    • HIPPA
    • Controlled Unclassified Information
    • ITAR/EAR

Useful Tools for Preparing Data for OURRstore

tar

tar (tape archive) is a standard Linux command-line tool used to combine multiple files into one archive file. This is useful if the data you plan to archive consists of a large amount of small files. This tool also allows you to compress the resulting archive file to reduce total file size and more easily transfer data to OURRstore.

To generate a uncompressed archive file from a set of files, run the following command.

tar -cvf <name-for-archive-file> <path-to-files> ...

Add the -z option generate a compressed archive file.

tar -zcvf <name-for-archive-file> <path-to-files> ...

If a specific compression method is needed (such as pigz to compress files in parallel), use the --use-compress-program option instead.

tar --use-compress-program=<compression-method-name> --cvf <name-for-archive-file> <path-to-files> ...

To later extract the files out of an archive file, run the following command. Add the -z option for compressed archive files.

tar -xcvf <name-for-archive-file> <path-to-files>


Suppose I have a 500 GB collection of files in a directory  research-data/ where each file is less than 1 GB. I can create one 500 GB archive file research-data.tar by running the following command.

tar -cvf research-data.tar research-data/

If the files were instead spread across multiple directories (research-data-1/ and research-data-2/), I could archive both of them together by running the following command.

tar -cvf research-data.tar research-data-1/ research-data-2/

If I want to instead generate a compressed archive file research-data.tgz, I would run the the following command.

tar -zcvf research-data.tgz research-data/

Alternatively, if there were many files in research-data/ and I wanted to use the pigz compression method to compress files in parallel and speed up the process, I would run the following command.

tar --use-compress-program=pigz --cvf research-data.tgz research-data/

split/cat

split is another standard Linux command-line tool used to split a large file into multiple smaller files. This is useful if the data you plan to archive consists of a large amount of small files.


To split a file into multiple files with a upper limit on the size , run split with the -b option.

split -b=<size> <file> <prefix-for-output-files>

To instead split a file into a specific number of equally-sized files, run split with the -n option.

split -n=<number> <file> <prefix-for-output-files>


Suppose I had a 2 TB file data.csv. If I want to split it up into 200GB files, I would run the following command.

split -b=200GB data.csv data.csv.

The command would result in 10 new files; data.csv.aa through data.csv.aj.

Alternative, I could have done the same thing by specifying the number of files instead.

split -n="10" data.csv data.csv.



If you want, you can change the suffix of generated files from letter to numbers with the -d option. If I ran the command with that flag for the previous example, the generated files would instead be data.csv.00 through data.csv.10.

split -d -b=200GB data.csv data.csv.


To combine the files back into a single file, use the cat command on the split files to combine all of their contents into one source, then redirect the output of cat to a file.

cat data.csv.* > data.csv




For a collection of files where there are both too big and too small files to store on OURRstore, a good method to prepare it is to use tar to gather them up into a single archive file and then split the archive file.

Suppose I had a 2 TB directory of files research-data/ with such files. I could run the following command to easily prepare the files for OURRstore.

tar -czvf reasearch-data.tgz research-data/ && split -b=200GB reasearch-data.tgz reasearch-data.tgz.


For more information, visit the links below:



Center for Computational Sciences