Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Current »

The HPC team is updating this page. Check back for new information.

Compressibility By Files Types

Compressing files types that are ≥0.80 should be avoided unless it’s part of a directory.

Type

Extension

Avg Ratio (compressed size/original size)

binary

<null>/.bin/.exe

0.25-1+

java binary

.jar

0.75-0.90

docx

.docx

0.80-0.85

gif

.gif

0.80-0.95

compressed files

.gzip/.zip/.bz2

>1

image files

.jpg/.jpeg/.png

0.93-1+

data files

.json/.xml

0.30-0.60

audio files

.mp3/.ogg/.mp4

0.80-0.95

pdf

.pdf

0.50-0.95

svg

.svg

0.30-0.57

fonts

.ttf

0.46-0.71

txt

<null>/.txt

0.32-0.55

wav

.wav

0.45-0.95

source files

.c/.cpp/.h/.java/.js/.py/.html/.css/.hpp/.lua

0.10-0.45

library files

.so

0.25-0.45

log files

.log

0.05-0.25

Source

To determine file type:

% ls -l
-rw-r--r-- 1 root root 2625604 Jun 15  2022 mstflint-4.16.0-1.53100.x86_64.rpm
-rwx------ 1 root root    1415 Mar 16 10:35 weka_install.sh

% file mstflint-4.16.0-1.53100.x86_64.rpm
mstflint-4.16.0-1.53100.x86_64.rpm: RPM v3.0 bin i386/x86_64 mstflint-4.16.0-1.53100

% file weka_install.sh
weka_install.sh: POSIX shell script, ASCII text executable

Considerations

Not all data compresses equally. Depending on the contents of the directory/file the space saving might be small or even make the file larger due to compression overhead.

Preparing Your Data For Archival

If your data contains many compressible files, then you should tar and compress your files up into one or more tarballs and store the tarballs. Although it involves an extra step, this will make it faster, easier to transfer your files, and reduce used space, because the system can handle the transfer of tarballs much more easily than the transfer of many small files.

Here's an example of using tar.

Suppose your data is in 3 directories. You may find it convenient to create a tarball for each directory, as show in this example:

# List directory
% ls -l
drwxr-xr-x  5 aaa0000 Domain_Users       4096 Jun 22  2018 data1
drwxr-xr-x  5 aaa0000 Domain_Users       4096 Jun 22  2018 data2
drwxr-xr-x  5 aaa0000 Domain_Users       4096 Jun 22  2018 data3

# Make compressed tarballs directly
% tar czf data1.tgz  data1
% tar czf data2.tgz  data2
% tar czf data3.tgz  data3

# Make tarballs
% tar czf data1.tar  data1
% tar czf data2.tar  data2
% tar czf data3.tar  data3

# Compress tarballs
% gzip data1.tar
% gzip data2.tar
% gzip data3.tar

It is VERY IMPORTANT that you remove the original copy after taring/compressing the file/directory.

# Delete the original copy otherwise you've just doubled the storage used
% rm -rf data1

Archiving Your Data

Once your data is compressed it is time to move it to the archive folder. This is only available to data on shared storage. This step isn’t strictly necessary, but it helps with file organization.

# List directory
% ls -l
drwxr-xr-x  5 aaa0000 Domain_Users       409 Jun 22  2018 data1.tgz
drwxr-xr-x  5 aaa0000 Domain_Users       409 Jun 22  2018 data2.tgz
drwxr-xr-x  5 aaa0000 Domain_Users       409 Jun 22  2018 data3.tgz

# Move the compressed tarballs
% mv data1.tgz /gpfs/sharedfs1/<PI_NetID>/ARCHIVE/<Your_NetID>

Viewing And Retrieving Archived Data

If you need to view or retrieve data that has been archived.

# List directory
% ls -l
drwxr-xr-x  5 aaa0000 Domain_Users       409 Jun 22  2018 data1.tgz
drwxr-xr-x  5 aaa0000 Domain_Users       409 Jun 22  2018 data2.tgz
drwxr-xr-x  5 aaa0000 Domain_Users       409 Jun 22  2018 data3.tgz

# View contents of compressed tarball
% tar tzf data1.tgz

# Extract a single file or directory
% tar xzf data1.tgz [-C /path/to/destination/directory] <internal_tar_file_path/to/file-or-directory>
  • No labels