Data Archival
The HPC team is updating this page. Check back for new information.
Compressibility By Files Types
Compressing files types that are ≥0.80 should be avoided unless it’s part of a directory.
Category | Type | Extension | Avg Ratio (compressed size/original size) |
---|---|---|---|
Programs | binary | <null>/.bin/.exe | 0.25-1+ |
java binary | .jar | 0.75-0.90 | |
| compressed files | .gzip/.zip/.bz2 | >1 |
Text related files | fonts | .ttf | 0.46-0.71 |
txt | <null>/.txt | 0.32-0.55 | |
docx | .docx | 0.80-0.85 | |
source files | .c/.cpp/.h/.java/.js/.py/.html/.css/.hpp/.lua | 0.10-0.45 | |
log files | .log | 0.05-0.25 | |
0.50-0.95 | |||
| library files | .so | 0.25-0.45 |
| data files | .json/.xml | 0.30-0.60 |
| audio files | .mp3/.ogg/.mp4/.wav | 0.80-0.95* |
Image related files | image files | .jpg/.jpeg/.png | 0.93-1+ |
svg | .svg | 0.30-0.57 | |
gif | .gif | 0.80-0.95 |
* certain types of .wav files can compress very well
To determine file type:
% ls -l
-rw-r--r-- 1 root root 2625604 Jun 15 2022 mstflint-4.16.0-1.53100.x86_64.rpm
-rwx------ 1 root root 1415 Mar 16 10:35 weka_install.sh
% file mstflint-4.16.0-1.53100.x86_64.rpm
mstflint-4.16.0-1.53100.x86_64.rpm: RPM v3.0 bin i386/x86_64 mstflint-4.16.0-1.53100
% file weka_install.sh
weka_install.sh: POSIX shell script, ASCII text executable
Considerations
Not all data compresses equally. Depending on the contents of the directory/file the space saving might be small or even make the file larger due to compression overhead.
Preparing Your Data For Archival
If your data contains many compressible files, then you should tar and compress your files up into one or more tarballs and store the tarballs. Although it involves an extra step, this will make it faster, easier to transfer your files, and reduce used space, because the system can handle the transfer of tarballs much more easily than the transfer of many small files.
Here's an example of using tar.
Suppose your data is in 3 directories. You may find it convenient to create a tarball for each directory, as show in this example:
# List directory
% ls -l
drwxr-xr-x 5 aaa0000 Domain_Users 4096 Jun 22 2018 data1
drwxr-xr-x 5 aaa0000 Domain_Users 4096 Jun 22 2018 data2
drwxr-xr-x 5 aaa0000 Domain_Users 4096 Jun 22 2018 data3
# Make compressed tarballs directly
% tar czf data1.tgz data1
% tar czf data2.tgz data2
% tar czf data3.tgz data3
# Make tarballs
% tar czf data1.tar data1
% tar czf data2.tar data2
% tar czf data3.tar data3
# Compress tarballs
% gzip data1.tar
% gzip data2.tar
% gzip data3.tar
It is VERY IMPORTANT that you remove the original copy after taring/compressing the file/directory.
# Delete the original copy otherwise you've just doubled the storage used
% rm -rf data1
Archiving Your Data
Once your data is compressed it is time to move it to the archive folder. This is only available to data on shared storage. This step isn’t strictly necessary, but it helps with file organization.
Viewing And Retrieving Archived Data
If you need to view or retrieve data that has been archived.