Poor compression of >20GB exe/msi/cab sample
-
@spwolf said in Poor compression support:
@diskzip do you plan to add full MT support for 7z? I think that is a must have if you want people to use your tool over 7z. Otherwise, it is much easier to test 7z with just using 7zFM since we can use 8t cpus properly and it cuts down testing on 20gb files by significant margin (35m vs 140m for this test on my computer).
Or does DiskZip do anything else for 7zip that affects compression, are results different between 7z and diskzip using 7z?
DiskZIP is fully multi-threaded, but the default compression profiles all favor smaller archive size over processing speed, so you would need to edit your compression settings in the DiskZIP GUI to spread usage over more cores. I am escalating this request internally to see where the magic happens here.
Note that with standard 7-Zip (or DiskZIP that consumes standard 7-Zip from a structured DLL interface), you need to limit thread counts to two for obtaining the best results. While LZMA2 has been optimized to spread the workload across multiple threads, doing so always does very substantial harm to the compression savings realized.
DiskZIP does not do anything that affects compression, so results should be 100% identical between 7-Zip and DiskZIP.
-
@diskzip said in Poor compression support:
@spwolf said in Poor compression support:
@diskzip do you plan to add full MT support for 7z? I think that is a must have if you want people to use your tool over 7z. Otherwise, it is much easier to test 7z with just using 7zFM since we can use 8t cpus properly and it cuts down testing on 20gb files by significant margin (35m vs 140m for this test on my computer).
Or does DiskZip do anything else for 7zip that affects compression, are results different between 7z and diskzip using 7z?
DiskZIP is fully multi-threaded, but the default compression profiles all favor smaller archive size over processing speed, so you would need to edit your compression settings in the DiskZIP GUI to spread usage over more cores. I am escalating this request internally to see where the magic happens here.
Note that with standard 7-Zip (or DiskZIP that consumes standard 7-Zip from a structured DLL interface), you need to limit thread counts to two for obtaining the best results. While LZMA2 has been optimized to spread the workload across multiple threads, doing so always does very substantial harm to the compression savings realized.
DiskZIP does not do anything that affects compression, so results should be 100% identical between 7-Zip and DiskZIP.
i could not find anything, even for smaller sets and less memory needed, it only uses around 20% of my cpu (8t cpu), while with same files and settings 7z would use up to 100%.
With only dictionary changed to d128M, i get 60MB smaller file by using 7zip vs using DiskZip. Something you can try on your end as well, maybe some other setting to be changed?
At that point, PA is smaller by 960M… with big lzma dictionary, it is basically used as dedup. I tested d768m with 7z and difference went down to 30-40M.
It will be interesting to see more results on my test computers once I am back from vacation, in some 15 days. I will be able to get it to work with various settings at that point while right now I can only use laptop and it takes more than 2hrs.
-
@spwolf said in Poor compression support:
end
OK, DiskZIP uses all available CPU cores with a 16 MB dictionary or smaller, and a maximum of 3 CPU cores with a 32 MB dictionary. A 64 MB dictionary or larger results in a core limit of 2.
Apparently these numbers are heuristic limits from a long time ago. Do you think we should move up the dictionary limits somewhat?
-
@spwolf said in Poor compression support:
Memory usage is 11.5x the dictionary size each 2 threads in mt setting for lzma2.
LZMA decoder is simple. But PPMd decoder is complex. LZMA2 is better than LZMA. LZMA2 compression does not replace (supersede) LZMA compression, but LZMA2 is merely an additional “wrapper” around LZMA. With LZMA2, data is split into blocks, but each block is still compressed by “normal” LZMA. Because individula blocks are compressed separately, processing the blocks can be parallelized, which allows for multi-threading. LZMA2 also allows “uncompressed” blocks, to better deal with “already compressed” inputs.
-
@winstongel said in Poor compression support:
@spwolf said in Poor compression support:
Memory usage is 11.5x the dictionary size each 2 threads in mt setting for lzma2.
LZMA decoder is simple. But PPMd decoder is complex. LZMA2 is better than LZMA. LZMA2 compression does not replace (supersede) LZMA compression, but LZMA2 is merely an additional “wrapper” around LZMA. With LZMA2, data is split into blocks, but each block is still compressed by “normal” LZMA. Because individula blocks are compressed separately, processing the blocks can be parallelized, which allows for multi-threading. LZMA2 also allows “uncompressed” blocks, to better deal with “already compressed” inputs.
That’s a dramatic oversimplification. LZMA2 very substantially hurts compression ratios when greater than two threads are used (which is why DiskZIP limits to two threads in virtually all compression scenarios with a non-minuscule [at least, by today’s standards] dictionary size).
-
@diskzip said in Poor compression support:
@spwolf said in Poor compression support:
end
OK, DiskZIP uses all available CPU cores with a 16 MB dictionary or smaller, and a maximum of 3 CPU cores with a 32 MB dictionary. A 64 MB dictionary or larger results in a core limit of 2.
Apparently these numbers are heuristic limits from a long time ago. Do you think we should move up the dictionary limits somewhat?
We’re pushing out an update soon, which adds a new “All” parameter to the multi-threading/hyper-threading setting.
This new “All” parameter will be the default in the regular and high compression profiles. Only the extreme compression profiles will stick to the previous “Yes” setting.
When “All” is selected here, all cores will be used. When “Yes” is selected here, the previous heuristics will apply (2 cores in most scenarios as above).
-
@diskzip said in Poor compression support:
@diskzip said in Poor compression support:
@spwolf said in Poor compression support:
end
OK, DiskZIP uses all available CPU cores with a 16 MB dictionary or smaller, and a maximum of 3 CPU cores with a 32 MB dictionary. A 64 MB dictionary or larger results in a core limit of 2.
Apparently these numbers are heuristic limits from a long time ago. Do you think we should move up the dictionary limits somewhat?
We’re pushing out an update soon, which adds a new “All” parameter to the multi-threading/hyper-threading setting.
This new “All” parameter will be the default in the regular and high compression profiles. Only the extreme compression profiles will stick to the previous “Yes” setting.
When “All” is selected here, all cores will be used. When “Yes” is selected here, the previous heuristics will apply (2 cores in most scenarios as above).
It is a very nice feature. But I am not sure if All should be equivalent to all cores, or to all threads in case of Hyper-Threading and similar.
-
@nikkho Oh, of course All is equal to all available logical cores - including hyperthreaded cores and physical cores.
-
@diskzip cool… now, did you try creating that archive with diskzip vs 7z.exe? Now that vacations are over, we got some new computers with 32g and 64g for ram, so we can test this particular case properly. I tested this 2 weeks ago but we are working on some new codecs so forgot to post results.
Turns out the main reason is likely that 7z.exe does better filter selection than PA in this particular case. And also, when you use 7z.dll like Diskzip does (and PA), you dont get file detection that 7z.exe has. Diskzip just forces bcj2 on everything, while we try to detect and apply appropriate filters based on extension but we either miss some extension or itanium filter that 7z.exe uses is the reason for this.
So my best case result was:
2.47G for 7z.exe
2.52G for PA- DiskZip and PowerArchiver .7z files were around 100M worse than .PA format.
Now lets try some multimedia files (wav files from 10GB Matt Mahoney test sample)
- PA 277M
- 7z 282M
- DiskZip 322M
Same dictionary setting (1536M). 7z and PA use same delta filters, difference is just lzmarec entropy coder in PA vs lzma. We do not have special WAV codec yet so basically everything is very similar to 7z when it comes to WAV files.
On the other hand, DiskZip does not use multimedia filters at all and also applies BCJ2 on everything (this might have adverse effect on speed and size). So next job would be to detect multimedia files better and apply delta filter on those. Try wav, bmp (and other similar non compressed multimedia files).
-
@spwolf Incorrect - DiskZIP does offer the same level of file detection that 7z.exe has.
You do have to manually select a custom compression string though:
This is because the default profiles do indeed indiscriminately apply the BCJ2 filter as you have already noticed.
With the custom compression strings, DiskZIP would be about 100 MB better than PA.
Note that I have never even been able to come close to your reported results with PA at all - something you’ll probably be able to fix in future releases.
DiskZIP’s results with multimedia would also improve on par with 7-Zip with the custom compression string, which enables the same auto-switching filter capability found in 7z.exe.
Have you also had a chance to run DiskZIP’s ZIPX compressor on multimedia files? That might outperform both 7-Zip and PA.
Please try it out and let me know what you find!
Does PA include JPEG compression at all? This is again provided by the ZIPX Plug-In in DiskZIP.
-
@diskzip using forced settings on 7z.dll doesnt work well, it will apply them to all files and worsen compression for everything. Unless you select settings for each file separately. So it is not a solution except for hand picked sample.
So what setting did you use on archive you sent, I am pretty sure it was created by 7z.exe and not diskzip? You can see it by type of filters used.
As to the zipx, it has significantly worse compression for anything (not to mention mt1 or mt2 in mI ost cases) but jpeg and wav, so so again users would have to compress files separately and hand pick which go into zipx and which goes into 7z.
With Pa, we added extension sort in latest version and you can use 2g window, so result is 2645, it is easily repeatable…
We filter all these files automatically and apply over 15 different codecs and filters, so generally it will produce better results in most real life cases than using different format for each file. Realistically you cant compare that to using 7z dll with bcj2 on all files. For people using 7z format, best bet is to use 7z.exe and not 3Rd party tools due yo the file detection that 7z.exe uses, your result will never be as good.
-
@spwolf said in Poor compression support:
@diskzip using forced settings on 7z.dll doesnt work well, it will apply them to all files and worsen compression for everything. Unless you select settings for each file separately. So it is not a solution except for hand picked sample.
So what setting did you use on archive you sent, I am pretty sure it was created by 7z.exe and not diskzip? You can see it by type of filters used.
As to the zipx, it has significantly worse compression for anything (not to mention mt1 or mt2 in mI ost cases) but jpeg and wav, so so again users would have to compress files separately and hand pick which go into zipx and which goes into 7z.
With Pa, we added extension sort in latest version and you can use 2g window, so result is 2645, it is easily repeatable…
We filter all these files automatically and apply over 15 different codecs and filters, so generally it will produce better results in most real life cases than using different format for each file. Realistically you cant compare that to using 7z dll with bcj2 on all files. For people using 7z format, best bet is to use 7z.exe and not 3Rd party tools due yo the file detection that 7z.exe uses, your result will never be as good.
No.
First, as I’ve mentioned before, DiskZIP is not based on 7z.exe or 7z.dll. We have our own proprietary wrapper around the 7-Zip code, probably similar to how PA does it - with the major difference that we do not actually change any compression code. The advantage of this approach is that we do enjoy all features that are inside 7-Zip as a “first class citizen” of the 7-Zip stack.
This includes, again as I have mentioned before, proper file sorting, automatic filter application, among others.
This is probably why you’re adamant the file I sent you must have been created using 7z.exe (or 7z.dll) instead of DiskZIP. If you still have any doubts, please pick the first command line override parameter (which configures for the highest compression) and reproduce the results for yourself. This will of course require a whole lot of memory, which is why we don’t include this setting by default in any of our out-of-the-box compression profiles.
As for your specific points:
Extension sort in DiskZIP is standard in all compression profiles, but can be turned off if the user so desires (just set File Sorting to false). This was actually a surprising regression in 7-Zip compression performance between the latest branches 16.x-17.x and the earlier 9.2x branches. Fortunately, we never passed this regression on to our customers.
Correct filter application is, again, automatic with the overridden command line - of course, if you use the GUI settings, the default profiles do lock you in to BCJ2, but you can change them to whatever you want (and it would be hardly unexpected to get the filter you configured at that point).
It would still have been nice had you been able to test our ZIPX format. Maybe the results would have surprised you?
Last but not least, the reason we did choose the BCJ2 filters as default in our profiles was because they performed best in our tests for all samples we came across, not just cherry-picked test cases. Of course, it is understandable that our median test cases and your median test cases would differ.
I hope that helps clear some of the confusion. At the end of the day, is PA still unable to meet or exceed DiskZIP’s compression on our sample data set (which is, again, a typical load for our environment, and not a hand-picked specially crafted set designed just to defeat PA - that would be quite hard to manually build, I’m sure!)
-
@diskzip again, your file that you provided seems to be created with 7z.exe. File created by DiskZip does not have these filters used and has 150 MB larger size. There is no special switch in 7zip to make it use itanium filter for instance, it is part of bcj2 and used when 7z.exe detects appropriate executable. When you use 7z.exe to create your folder, it will automatically apply itanium filter when it detects such executable.
Same goes for the use of multimedia delta filter. Diskzip does not use delta filters at all as it needs 7z.exe to determine what file to use it on. As such, DiskZip will always create different/worse results compared to 7zip.
Please let me know what switch to use in order to replicate file you have sent as example of DiskZip compression.
As to “your” zipx compression, it is actually WinZip Zipx compression. You are aware that zipx is not solid compression and that it creates significantly larger files than 7z? When it comes to special codecs - PA format has better codecs for text, jpeg, mp3 compared to WinZip zipx (specifically whats unique compared to WinRar and 7zip) together with everything else that makes PA format unique. It is considerably faster while having considerably better compression at the same time.
So yeah, we try WinZips zipx every time we add new codec to PA format and benchmark it against it, as well as 7zip, WinRAR, FreeArc as well as speciality compressors like precomp. Obviously we try to do something better, even if we do not succeed each time.
And since you obviously consider diskzip to be on par with other compression applications, why not submit it for testing at:
http://www.squeezechart.com/That way you will have independent tests to back up those claims of first class citizen. Problem is that in tests that I do with DiskZip, it creates larger files than 7zip but I would love to be proven wrong. I am sure @Stephan will be helpful as usual.
-
Thank you for sticking around, we’ll get to them bottom of this.
Please see the step-by-step instructions I’ve posted here for configuring the compression string manually:
https://encode.ru/threads/130-PowerArchiver?p=53664&viewfull=1#post53664
It is easier to attach images to a post there, so please excuse me for having to redirect you there for instructions. There are precise instructions there though, so your detour is worth it.
For ZIPX, we use the BricolSoft Zip ActiveX control - this excellent library may be worth for you folks to check out as well. After all, nobody on this thread invented 7-Zip, and nobody here invented ZIPX either!
-
@diskzip said in Poor compression support:
https://encode.ru/threads/130-PowerArchiver?p=53664&viewfull=1#post53664
cool, will test it out with that setting sometimes soon.
-
We have just released DiskZIP 2018, with the following enhancements:
https://encode.ru/threads/2763-DiskZIP
Thank you very much for the feedback you provided on this thread regarding the usage of CPU cores. This issue is now addressed by way of utilizing all available CPU cores (with the exception of the highest compression profiles, which limit to two CPU cores in order to ensure the best compression ratios).
Please also note, you would now need to choose the second command line override string instead of the first, because in DiskZIP 2018, the first command line override string again forces BCJ2 compression :)
I know, ironic! The customer is always right, and that’s what they wanted…
So please excuse the inconvenience caused by the inaccurate instructions - as you will see for yourself when you take a look at the custom compression strings though, the second one does engage 7-Zip’s automatic file detection filters.
I would like to thank you all once again, on behalf of the entire DiskZIP team, for your feedback - and we look forward to any further feedback you may have on the 2018 release. Wishing you all a great weekend!
-
@diskzip now we get PR releases on our forums, do you have a link to it so i can delete the huge post relevant mostly to your product?
-
@spwolf You are right, please excuse my enthusiasm. I have made the edit directly, please don’t hesitate to let me know if it is not satisfactory.
-
@spwolf Have you had a chance to test our native compression yet? Please note, as I described in my new release announcement post, to use an auto-switching filter, you now how to use the second command line override parameter (instead of the first as it was in the prior version). Thanks!
-
@diskzip whats native compression, 7z?
Did not have a chance to try new version yet, i might this week.
As to the PA format, our current record with simple/optimized setting is:
PA 17.01.03: 2,460,182 kB (d2G, mt1)
7zip max settings: 2,593,514 kB
Your sample: 2,600,325 kBWe adjusted extension groupings, added more extensions to the exe/dll group that our exe filters are applied to and added unlimited size blocks to get better compression on samples over 10GB.
I think we can do at least 20-30M better with current settings, not everything is still perfectly grouped together for this sample.
So currently around 130M better than best 7z settings, 136M better than dzip.
