Poor compression of >20GB exe/msi/cab sample

nikkho

It was raised poor PA compression ratio:

Uncompressed: 21,657,900,590 bytes
7-Zip (your package): 2,662,732,158 bytes
PowerArchiver 17.00.90 (Optimize Strong): 3,398,179,937 bytes

Find package at: https://mega.nz/#!0aRDiAKQ!lrwtC64jnkk4d0ZKjcVGgLKPCcOqyUSyAQ62JJtQZOM[/QUOTE]

spwolf

@diskzip whats native compression, 7z?

Did not have a chance to try new version yet, i might this week.

As to the PA format, our current record with simple/optimized setting is:
PA 17.01.03: 2,460,182 kB (d2G, mt1)
7zip max settings: 2,593,514 kB
Your sample: 2,600,325 kB

We adjusted extension groupings, added more extensions to the exe/dll group that our exe filters are applied to and added unlimited size blocks to get better compression on samples over 10GB.

I think we can do at least 20-30M better with current settings, not everything is still perfectly grouped together for this sample.

So currently around 130M better than best 7z settings, 136M better than dzip.

spwolf

@nikkho what kind of files are they? Downloading, but it will take a while for me since I am on vacation, poor connection. Thanks.

Also, you can post a log of pa compression, lets see whats being applied.

spwolf

moved the thread of Advanced Codec Pack forums…

diskzip

I am the original uploader of this package, per this parent thread:

https://encode.ru/threads/130-PowerArchiver/page3

We also use a 7-Zip modification internally at DiskZIP, and are open minded about working with vendors who are able to improve the compression ratio.

DiskZIP also includes patent pending transparent disk compression which is a 100% DiskZIP exclusive.

I am looking forward to hearing from you on whether you are able to improve the compression ratio for my dataset. Thank you!

spwolf

@diskzip one rule here is no advertising of other software… as long as we can do that, no problem. You can also talk with eugene on compression related issue, and I am sure you know him from encode.ru.

I will check this specific data sample, as much as I know it is some game data?

diskzip

@spwolf Sure, no further ads :)

I am not familiar with eugene, at least not by that handle.

This is not game data but application data.

spwolf

@diskzip … shelwien on encode.ru.

Seems like some multimedia data that might like some multimedia filter applied… I will take a look sometimes next week to see where is the difference. Thanks.

Brian Gregory

For one thing it looks to me like .msi and .msp files probably aren’t being compressed as well as they should be.
EDIT: Though come to think of it 7zip won’t be doing anything better with them so I don’t know.

eugene

See diskzip’s previous post about this - https://encode.ru/threads/?p=53578&pp=1
PA can compress this set better, just not with default GUI settings or some such.
My current theory is that diskzip compares single-threaded vs MT results here,
where MT works by blockwise splitting of data.
Basically, its not enough to compare archive size here, paq8px would also compress better, so what.

spwolf

@eugene after a bit of testing, original .7zip uses big 1536M dictionary and mt1 setting, which is the reason for smaller file. By default, .pa uses 96M max I believe. Regular 7z Ultra setting creates 4.6GB file due to 64M dictionary.

I cant even test PA with same settings with my laptop since 1.5G dictionary at mt1 takes around 18GB of RAM.

There are a lot of similar and near identical files so huge dictionaries will always help, even if they are impractical for any kind of normal use.

diskzip

I can test locally here, but I am having a hard time configuring PA for testing at maximum compression settings. Where and how can I configure PA for the best available compression settings?

diskzip

@eugene said in Poor compression support:

See diskzip’s previous post about this - https://encode.ru/threads/?p=53578&pp=1
PA can compress this set better, just not with default GUI settings or some such.
My current theory is that diskzip compares single-threaded vs MT results here,
where MT works by blockwise splitting of data.
Basically, its not enough to compare archive size here, paq8px would also compress better, so what.

DiskZIP’s compression here uses two threads precisely.

diskzip

@spwolf said in Poor compression support:

@diskzip … shelwien on encode.ru.

Seems like some multimedia data that might like some multimedia filter applied… I will take a look sometimes next week to see where is the difference. Thanks.

Not really multimedia at all. Primarily application binaries and pre-compressed application runtimes.

spwolf

@diskzip a lot of repeat data that works well with 1.5GB dictionary… you can set plzma to 2000M in settings, and mt to 1, and lets see how it works.

I tested with 720m dictionary and that got it down another 400MB. But thats about the limit of my 12GB laptop.

When it comes to testing, having a test case that requires 18GB-25GB of free RAM is just too hard and obscure to test. It would be better to have some sample that can use proper multithreading and reasonable dictionary that users will actually end up using - for instance 128m and 8t.

When it comes to comparing lzma2 to our plzma with lzmarec entropy coder, you should see around 2%-4% improvement all other things being equal.

For us, while lzmarec is nice and can always show improvement over same settings for lzma2, it is not the main point of the PA format… more important are all these other codecs - mp3, lepton/jpeg, reflate for pdf/docx/deflate, bwt for text, mt ppmd for some multimedia files, deduplication filter for everything that works 50-60MBs, etc, etc… and how it all works automatically and multithreaded.

0_1502298580835_b96b6382-0690-4405-a6f9-0cc0a3b66773-image.png

0_1502298658084_011f0676-5cdc-43f0-a17e-65a0a28c59cd-image.png

diskzip

@spwolf Is it necessary to restrict PA to only one thread? DiskZIP obtained this result on two threads, not one.

How about other filters - do I need to override any of those settings as well, especially in light of the large number of binaries included in my distribution?

diskzip

@spwolf said in Poor compression support:

@diskzip a lot of repeat data that works well with 1.5GB dictionary… you can set plzma to 2000M in settings, and mt to 1, and lets see how it works.

I tested with 720m dictionary and that got it down another 400MB. But thats about the limit of my 12GB laptop.

When it comes to testing, having a test case that requires 18GB-25GB of free RAM is just too hard and obscure to test. It would be better to have some sample that can use proper multithreading and reasonable dictionary that users will actually end up using - for instance 128m and 8t.

When it comes to comparing lzma2 to our plzma with lzmarec entropy coder, you should see around 2%-4% improvement all other things being equal.

For us, while lzmarec is nice and can always show improvement over same settings for lzma2, it is not the main point of the PA format… more important are all these other codecs - mp3, lepton/jpeg, reflate for pdf/docx/deflate, bwt for text, mt ppmd for some multimedia files, deduplication filter for everything that works 50-60MBs, etc, etc… and how it all works automatically and multithreaded.

So following the order of these instructions, the custom dictionary setting was lost. I had to repeat that step - glad I double-checked. Not the most intuitive UI, if you are open to a bit of negative feedback.

Another negative tidbit, it took about 1 minute for the operation to initiate (for the compressing files window to appear) after I clicked the Finish button.

Not the best user experience really, but I am excited to see what actual compression savings will result.

spwolf

@diskzip yeah, i noticed i posted wrong order but i figured you will figure it out… we have to reset the settings so users who enter wrong ones can go back to defaults, but otherwise users can easily save a profile with those settings and then always use that profile.

spwolf

@diskzip said in Poor compression support:

@spwolf Is it necessary to restrict PA to only one thread? DiskZIP obtained this result on two threads, not one.

How about other filters - do I need to override any of those settings as well, especially in light of the large number of binaries included in my distribution?

no, you dont need to do anything else… you are actually using 7z.exe and lzma2, right? lzma2 uses 2 threads per dictionary when it comes to memory - so in this case it is 11.5 x 1536M. Plzma is different not only due to different entropy coder, but also it is parallel version of lzma. So multiple threads are used for both compression and extraction. It also has larger maximum dictionary at 2000M.

Of course, even with mt1, there are multiple threads being used, depending on files, size, extension - for instance lzmarec entropy coder uses more than 1 thread anyway, and we also always use some extra filters.

In any case, what is maximum dictionary you use in your product? I am sure it is not 1.5G since thats 18GB of ram usage?

diskzip

@spwolf said in Poor compression support:

@diskzip said in Poor compression support:

@spwolf Is it necessary to restrict PA to only one thread? DiskZIP obtained this result on two threads, not one.

How about other filters - do I need to override any of those settings as well, especially in light of the large number of binaries included in my distribution?

no, you dont need to do anything else… you are actually using 7z.exe and lzma2, right? lzma2 uses 2 threads per dictionary when it comes to memory - so in this case it is 11.5 x 1536M. Plzma is different not only due to different entropy coder, but also it is parallel version of lzma. So multiple threads are used for both compression and extraction. It also has larger maximum dictionary at 2000M.

Of course, even with mt1, there are multiple threads being used, depending on files, size, extension - for instance lzmarec entropy coder uses more than 1 thread anyway, and we also always use some extra filters.

In any case, what is maximum dictionary you use in your product? I am sure it is not 1.5G since thats 18GB of ram usage?

DiskZIP doesn’t invoke 7z.exe, we have our own low-level wrapper around 7-Zip; unlike PowerArchiver though, we don’t actually implement our own custom algorithm(s) or change the default 7-Zip compression in any way (other than exposing 7-Zip functionality in a nice, structured API with callbacks, etc.) - we also license this 7-Zip library to third parties for their use.

The results with PA using your exact settings are 2.86 GB, I am at a loss to understand why PA has performed so poorly on this data set.

Our dictionary is indeed exactly 1.5 GB - this is the 7-Zip maximum for present time (and even already this presents some problems with extraction on 32 bit systems due to memory fragmentation). It is LZMA2, of course, and with 2 threads.

I may have misreported the memory requirements - but don’t blame me, blame the Windows Task Manager! I see it going up to 17.X GB (so cap it at 18 GB) with the 1.5 GB dictionary. With a 1 GB dictionary, it goes up to 10 GB (give or take a gigabyte).

spwolf

@diskzip interesting, i got 2.83G with 720m dictionary… It just has a lot of similar files so large dictionary with lzma does the wonders there. Doesnt seem like there is anything else to it.

Memory usage is 11.5x the dictionary size each 2 threads in mt setting for lzma2.

But how many of users have =>24GB required for such setting though?

Support for mobile devices and android?

Make .PA open to others programs?

PowerArchiver 2019 Toolbox International with Advanced Codec Pack

Poor compression of >20GB exe/msi/cab sample

Advanced Codec Pack - engine list of changes

Some test results of mp3 to .pa .pa experimental

Experimental Codecs - info, updates

settings for wav/sf2 files (from 17.00.68)

Releasing unpacking library

brilliant format

Optimized Strong, initial tests speed/compression

Filters: Reflate - (pdf/docx recompression)

FMA-REP - info and test results (.pa)

Poor compression of >20GB exe/msi/cab sample