Hash based deduplication, sorting similar content together, and other things to help rep-fma

tokiwarthoot

PA needs a huge push on the deduplication department.

Further improvements to rep-fma will help but given that rar for example can completely destroy PA for some sets of data it would be a good idea to consider other options.

Hash based deduplication like rar5 can do could be a very good way to ensure no complete duplicate files are stored twice with almost complete disregard for dictionary sizes or how far apart they are.

Sorting the data before storing it could also be a good idea, compress all the files with the same extension secuentially, given that they are more likely to contain similar data.

these kinds of optimizations are all in all relatively cheap to do in terms of run time but can help for deduplication dramatically in cases plzma4 and rep-fma are failing

spwolf

@tokiwarthoot do you have any info on what kind of dedup rar5 uses? Cant see it in practice here. Ah, found the news section, it has special switch.

As to fma-rep, i assume the problem is that we currently have a limit of 2GB. So anything over 2GB limit will not get properly deduplicated.

Otherwise, fma-rep is not file based but data based, similarities inside that max 2GB window will be used to lower the overall size, so it has far wider application than just file based dedup… and it works really well for those times when plzma4 memory usage would be too much or when zstd2 is simply not good enough. I think it is almost the best part of current .pa - while reflate is way more complicated and significant, fma-rep applies to every set of data.

@eugene is working on rep2 that brings not only unlimited size but also increases the speed, maybe he can fill us with latest news :).

eugene

Hi.

PA needs a huge push on the deduplication department.

Maybe, but rep1 is probably underused atm.
At least for archives under 2G in size it would remove all the dups… if you’d let it have the memory.
The main problem is actually its processing speed (~100Mb/s), not dedup per se.

Further improvements to rep-fma will help but given that rar for example
can completely destroy PA for some sets of data it would be a good idea to
consider other options.

Again, its probably a matter of having right codec/parameter profiles, rather than
actual improvements. We could look into it if you can provide sample archives.

Well, rar actually has a wav audio filter, which we don’t yet have atm,
but otherwise our format should provide better compression at least,
aside from some corner cases.

Hash based deduplication like rar5 can do could be a very good way to
ensure no complete duplicate files are stored twice with almost complete
disregard for dictionary sizes or how far apart they are.

I agree, but unfortunately the current format framework for .pa doesn’t allow this.
However we’re aware of the idea, and it would be implemented after format upgrade,
which is planned anyway.

Sorting the data before storing it could also be a good idea, compress all
the files with the same extension secuentially,

Extension sorting is already used in PA though.

tokiwarthoot

Ah, didn’t realize extension sorting was already used, I assumed because of some results that I got that it wasn’t being used.
It must have been that even then, the files were still too far for the plzma dictionary or the rep-fma memory on the extreme preset to catch the duplication.

I had already seen problems (or rather, underwhelming results) with deduplication on PA, at the very least with the presets settings right now, but yesterday I found one that was pretty big.

I was testing the new jpeg codec so I search for jpeg files on my pc and found that the league of legends game client has 1.7gb of jpeg files right there, some big some small… so it just seemed reasonable to try it with that varied collection of jpg files.

Copied them all to some folder and compressed them with different programs, and of course PA on StrongOptimize with the new jpeg compression.

rar finished in about a minute and gave me a 850mb file.
rar5 finished in about 3 minutes and gave me a 817mb file.
7z ultra finished in about 4 minutes and gave me a 1.65gb file.
PA plzma4 on extreme preset finished in about 6 minutes and gave me a 1.62gb file.
PA StrongOptimize with the jpg codec, extreme preset, finished in about 10 minutes and gave me a 1.45gb file. (so the jpg compression is great but sadly in this dataset deduplication happened to be more important)

It turns out the game has most of those files twice.

In general, for any set of data with duplication farther apart than what fma-rep can catch given the memory you allow it to have, PA will store that data again. The problem would be that given the preset settings as they are right now, the data doesn’t need to be too far apart for this to happen.
I think that maybe the higher presets should give rep more memory to catch more duplicated data.

Also, compressing backups is a very common scenario where this could come around.
If the backups are made by hand just copying and pasting to a folder with the date on an external file, or by a simple script that zips the files and copies them to a server or in any other non-incremental non-compressed way, a lot of duplication is to be expected.
Lets say I’ve been doing this for 4 years now and want to compress the first 3 years to save space.
Proper deduplication is key for something like this.

And this is a scenario I’ve actually encountered many many times working IT for SMEs.

In any case, just wanted you guys to know this weakness of the format at the very least with the current preset settings.
Hash based deduplication would be great, but if it will have to wait until a format upgrade it’s not an option for now.
Rep2 sounds great but Eugene is still working on it so it’s also not an option right now.

So it seems it will be like this for a while.
The only other option would be changing the presets settings, maybe lowering a little the plzma4 settings of the higher presets to give more memory to rep-fma would give more consistent results.
But of course, this would be compromising the compression ratio for sets of data without much duplication so it may not be worth it.

spwolf

@tokiwarthoot that explains it, since fma-rep can not be used on jpeg files, since it is based on data and not files, it will make jpegs not recognizable by the jpeg codec. This only happens with jpegs so this is why you saw it with jpegs.

In this specific case, it would be better to turn off jpeg codec at which point fma-rep would work on these files.

So you would use zstd2 codec at Maximum setting, and then turn on fma-rep in Advanced Options. You should get something like this in the options window:
c_out=zstd2:x10:c1M:mt1:d24; rep1:fb1000:c300M:mem2000M { c_out }

you can change the chunk size to 30M or so manually in the options as well. This manual setting is not debugged properly yet.

What do you get then? Make sure you check the log to see what was used on the set.

thanks!

tokiwarthoot

Okay I did what you told me to do and the custom compression parameters show:

c_out=zstd2:x10:d24:mt1;rep1:fb1000:c300M:mem2000M { c_out }

(some things need to be fixed on that UI, enabling rep-fma automatically changes the compression method to plzma4 for example, changing rep’s dictionary size is also ignored, at the very least it’s not shown in the custom compression parameters, but thats a tangent)

The file still is 1.64gb.

But the debug is actually interesting
"
64-bit version
Filename: C:\Users\Administrator\Desktop\League of Legends jpgs\League of Legends jpgs.pa
c_out=zstd2:x10:d24:mt1;rep1:fb1000:c50M:mem300M { c_out }

-mf=off
-m0=rep1:fb1000:c50M:mem300M
-m1=zstd2:x10:d24:mt1
-mb00s0:1=
Start time: 30/01/2017 14:54:47
End time: 30/01/2017 14:55:40
"
Apparently in spite of the settings actually selected on the UI, rep is running with a much smaller dictionary size.

spwolf

@tokiwarthoot try manually editing it again to 2000. It should set it properly after that. Also mt setting is set to mt1 by error there too, so you can manually change it to mt6.

Whole Advanced tab never got debugged properly yet, maybe tomorrow we will have time for that. Due to correlation between filters and codec options, it gets quite complicated.

Log always shows whats happening and whats wrong in the gui right now, so it is very valuable for our understanding.

Also, we dont have a fb setting in advanced options for fma rep yet, that defines how matches are found. I have to see here for jpegs what would be good example to use.

Thanks!

tokiwarthoot

@spwolf setting 2000mb on the dictionary size manually did make it work properly and deduplicate as it should.

it’s a shame through that we must choose between the jpg compression and rep for jpgs files.
if applying the lepton codec is a deterministic transformation it should be possible to compress the jpgs and then feed them into rep, identical jpgs would be the same data compressed by the lepton codec and would then be deduplicated by rep.

spwolf

@tokiwarthoot @eugene would know more why exactly it does not work that way.

Real benefit of fma-rep is that it works across whole data set but uses a lot less memory. For instance, if we quickly compress MS Office 2016 iso (conveniently picked to be around 2GB), we will see real benefit of fma-rep over lz coders with 64M dictionary:

0_1485808489335_upload-2c8d7666-f844-4420-94cd-e9cdb2d9e801

Most of the difference is solely because of the fma-rep and some is due to lzmarec. So improvement in new fma-rep2 will not only help speed but also these kinds of backups.

Arydigital

This post is deleted!

Also enhance ZIPX-compression support (for JPEG and MP3)

show estimated memory usage in add window (.pa)

Add Data Integrity Support (e.g. Parchive) to .pa Format

Hash based deduplication, sorting similar content together, and other things to help rep-fma

jpeg codec

Unsolved Hash based deduplication, sorting similar content together, and other things to help rep-fma