Thursday, October 26, 2023

Geo JSON compression: 7z vs Zip

When storing Goo data in JSON format usually there is a lot of redundancy.
Many Geo items, i.e. those from photo EXIF have same or similar values.
So they "should" be very compress-able.
When saved in separate, small JSON files, standard ZIP is very ineffective.
Example

2,482 Geo JSON Files
Size: 4.15 MB (4,355,916 bytes)
Size on disk: 9.69 MB (10,166,272 bytes)

(Windows) zip: 2.17 MB file
So while total storage on disk is about 5 times smaller, effective size reduction is about 2x

When using 7-zip program, the result file is AMAZINGLY small: 7z 100 KB file
That is 20 times better than Zip, and 100 times smaller than original!

Now, having many small files is not optimal to begin with, so when they are "merged" to a single
JSON file (object using file name as key), the size of such single JSON file is 3.37 MB
(Windows) Zip of this single JSON file is now very small, 113 KB

Download 7zip

The main difference is, besides higher compression by 7zip, 
that 7zip is apparently using same shared "dictionary" for segments of files,
while zip is likely compressing each separately.

Optimally the individual files should be possible to add to archive and extract one by one.
That way the archive could effectively be used as a simple "database" for compressed files. 


i.e. to add a JSON file to an archive, would do this:

7z a -t7z archive.7z newfile.json

7z e archive.7z -o [outputdir] file.json

put "&" special character before 7z command


& "C:\Program Files\7-Zip\7z.exe" a -t7z geo.7z geo\20230818_153424.jpg.json


using from a node.js program:


7-Zip precompiled binaries.


7zip - npm
a lite-version of 7zip, ≈2.4MB.
var _7z = require('7zip')['7z']
var task = spawn(_7z, ['x', 'somefile.7z', '-y'])




The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov[1] and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio (generally higher than bzip2)[2][3] and a variable compression-dictionary size (up to 4 GB),[4] while still maintaining decompression speed similar to other commonly used compression algorithms




No comments: