Encoding of filenames in zip archives
-
Hi,
i’m looking for some technical advice about ZIP archives in general. i’ve developed a small utility that reads zip archives and extracts the names of the files within, in C++ and using the zlib version 1.2.1. since i’m working in a UTF-8 environment, i need some conversion, and i had always assumed (i can’t remember whether i read it somewhere or just found out) that the internal encoding for filenames was cp-850 (DOS codepage).
however, i’ve received an archive created by EasyZip. in it is a file whose name contains special chars (german umlauts), and i’ve noticed that the conversion didn’t work. further searching showed that in this EasyZip archive internal filenames are encoded using ISO-latin-1 (ISO-8859-1). some zip utilities seem not to have any problem reading the filename (eg winzip), some others (winrar, powerarchiver, …) just read it wrong – as if it were encoded with the cp-850 codepage.
i’ve searched the zlib documentation, and the zip specification on the pkware webpage, and i haven’t been able to find out where the encoding used might be specified. i guess it’s possible since some programs recognize it, but i’m at a loss as to where i should look for it…can someone help me with that, or give me a hint? thx a lot in advance
David
-
Can U Post The Archive Here So We Can Check It And Test It And Hopefully A Fix Will Be Made
-
i’ve searched the zlib documentation, and the zip specification on the pkware webpage, and i haven’t been able to find out where the encoding used might be specified. i guess it’s possible since some programs recognize it, but i’m at a loss as to where i should look for it…
can someone help me with that, or give me a hint? thx a lot in advance
David
Not really sure if this is an answer, but maybe this thread gives a hint?
http://www.powerarchiver.com/forums/showthread.php?t=1196 -
hi
first of all thx for the replies
Not really sure if this is an answer, but maybe this thread gives a hint?
http://www.powerarchiver.com/forums/showthread.php?t=1196this certainly has something to do with my problem. however, what i’m looking for is some technical hints (ie on the programming level) as to how to recognize the internal encoding to perform the proper conversion. i’m posting some example zip files which depict my problem.
the archives have all been created with different zip programs (as described in the filenames). easyzip.zip and winzip.zip both contain one file named ÄÖÜßäöü.txt (ÄÖÜäöüß). winxp.zip and ziputil.zip both contain one file named NeüerNäme.txt (Ne¨erNäme)
weirdly enough, different programs manage to open different archives:
winzip gets them all right
winrar gets the winxp.zip wrong, but all the others alright.
powerarchiver (9.25) and my program get winzip and winxp right, but the other ones wrong.when i say “get them wrong”, it only means that the filenames are scrambled. for instance, when i extract easyzip.zip with powerarchiver, the extracted file is named -Í_¯õ÷³.txt (can’t write this with HTML codes… however, that’s exactly what you get when you interpret the characters in the filename as CP-850 although they’re ISO-8859-1).
again, this is not a bug report against powerarchiver (although you could use it as one if you want ;-) ). my primary concern is to find out how i can recognize the encoding in my own program.
again thx for your help!
attachment_p_5670_0_easyzip.zip
attachment_p_5670_1_winxp.zip
attachment_p_5670_2_winzip.zip
attachment_p_5670_3_ziputil.zip -
I think it is safe to assume that encoding should be CP -850. Problem is that many zip engines are not built up to standards and they always use their own logic.
I briefly checked app note and I didnt see anything about code sets.
-
I think it is safe to assume that encoding should be CP -850. Problem is that many zip engines are not built up to standards and they always use their own logic.
I briefly checked app note and I didnt see anything about code sets.
hmmm… not sure about this. does that mean that one should check which program was used to create the archive and react accordingly, ie to say “if program == winzip then encoding = cp850 else if program == easyzip then encoding = iso-8859-1 else …”?
i mean, since the last version of winzip (9.0 i think) seems to be able to cope with different encodings, how do they do it?
i also checked some docs, especially that of PKWare, and it doesn’t say anything about CP-850 either i think. they took it as a standard because it was the DOS encoding at the time they first created the app, and everyone followed them till apparently a few years ago, where others started to take their own path. i so want to know how to find out what that path is!
i haven’t tried with the last version (2006 b6) of powerarchiver, does it solve the problem? if not, shouldn’t it?
-
hmmm… not sure about this. does that mean that one should check which program was used to create the archive and react accordingly, ie to say “if program == winzip then encoding = cp850 else if program == easyzip then encoding = iso-8859-1 else …”?
i mean, since the last version of winzip (9.0 i think) seems to be able to cope with different encodings, how do they do it?
i also checked some docs, especially that of PKWare, and it doesn’t say anything about CP-850 either i think. they took it as a standard because it was the DOS encoding at the time they first created the app, and everyone followed them till apparently a few years ago, where others started to take their own path. i so want to know how to find out what that path is!
i haven’t tried with the last version (2006 b6) of powerarchiver, does it solve the problem? if not, shouldn’t it?
i think easyzip (which is previous version of PA btw) and earlier versions of ziptv engine simply used whatever code set was active. This is wrong. There are few other things that EasyZip does, that are out of zip specs as well.
Maybe Ivan will chime in with more info on this, I am sure they fixed the problem in one of the easyzip/pa releases 4-5 years ago.
You could always code some kind of code set check, or get it from somewhere else, but this is an solution for the problem that almost does not exist, since all of the newer zip engines are now using cp-850. I dont think we have had many support requests for this issue (if any) in past year or so, although it was more frequent maybe 4 years ago.
-
… this is an solution for the problem that almost does not exist, since all of the newer zip engines are now using cp-850.
might be. however, i checked the last zip/unzip utilities i found on the info-zip homepage (http://www.info-zip.org/). the latest version of zip (feb. 2005) definitely uses ISO-8859-1, and powerarchiver definitely has a problem with it (check the attachment). perhaps the reason why this issue was less addressed lately was that the unzip utility (which i think is commonly used on linux) as well as winzip (which is more or less the standard on windows) have learned to cope with this discrepancy, although unzip still has problems with eg win xp’s zip archives, and not with winzip’s, though they both use CP-850 internally. probably winzip sets some flag which indicates the codeset, and win xp doesn’t. i’ll check unzip’s source code, maybe i can find something there.
cheers,
-
i checked the source code of the unzip utility, and the solution is actually there. i can post the explanation and some code implementing it if some people are interested.
David