Data Archival

The HPC team is updating this page. Check back for new information.

Compressibility By Files Types

Compressing files types that are ≥0.80 should be avoided unless it’s part of a directory.

Category

Type

Extension

Avg Ratio (compressed size/original size)

Category

Type

Extension

Avg Ratio (compressed size/original size)

Programs

binary

<null>/.bin/.exe

0.25-1+

java binary

.jar

0.75-0.90

 

compressed files

.gzip/.zip/.bz2

>1

Text related files

fonts

.ttf

0.46-0.71

txt

<null>/.txt

0.32-0.55

docx

.docx

0.80-0.85

source files

.c/.cpp/.h/.java/.js/.py/.html/.css/.hpp/.lua

0.10-0.45

log files

.log

0.05-0.25

pdf

.pdf

0.50-0.95

 

library files

.so

0.25-0.45

 

data files

.json/.xml

0.30-0.60

 

audio files

.mp3/.ogg/.mp4/.wav

0.80-0.95*

Image related files

image files

.jpg/.jpeg/.png

0.93-1+

svg

.svg

0.30-0.57

gif

.gif

0.80-0.95

* certain types of .wav files can compress very well

Source

To determine file type:

% ls -l -rw-r--r-- 1 root root 2625604 Jun 15 2022 mstflint-4.16.0-1.53100.x86_64.rpm -rwx------ 1 root root 1415 Mar 16 10:35 weka_install.sh % file mstflint-4.16.0-1.53100.x86_64.rpm mstflint-4.16.0-1.53100.x86_64.rpm: RPM v3.0 bin i386/x86_64 mstflint-4.16.0-1.53100 % file weka_install.sh weka_install.sh: POSIX shell script, ASCII text executable

Considerations

Not all data compresses equally. Depending on the contents of the directory/file the space saving might be small or even make the file larger due to compression overhead.

Preparing Your Data For Archival

If your data contains many compressible files, then you should tar and compress your files up into one or more tarballs and store the tarballs. Although it involves an extra step, this will make it faster, easier to transfer your files, and reduce used space, because the system can handle the transfer of tarballs much more easily than the transfer of many small files.

Here's an example of using tar.

Suppose your data is in 3 directories. You may find it convenient to create a tarball for each directory, as show in this example:

# List directory % ls -l drwxr-xr-x 5 aaa0000 Domain_Users 4096 Jun 22 2018 data1 drwxr-xr-x 5 aaa0000 Domain_Users 4096 Jun 22 2018 data2 drwxr-xr-x 5 aaa0000 Domain_Users 4096 Jun 22 2018 data3 # Make compressed tarballs directly % tar czf data1.tgz data1 % tar czf data2.tgz data2 % tar czf data3.tgz data3 # Make tarballs % tar czf data1.tar data1 % tar czf data2.tar data2 % tar czf data3.tar data3 # Compress tarballs % gzip data1.tar % gzip data2.tar % gzip data3.tar

It is VERY IMPORTANT that you remove the original copy after taring/compressing the file/directory.

# Delete the original copy otherwise you've just doubled the storage used % rm -rf data1

Archiving Your Data

Once your data is compressed it is time to move it to the archive folder. This is only available to data on shared storage. This step isn’t strictly necessary, but it helps with file organization.

Viewing And Retrieving Archived Data

If you need to view or retrieve data that has been archived.