unicode - Output difference after reading files saved in different encoding option in python -
i have unicode string list file, saved in encode option utf-8. have input file, saved in normal ansi. read directory path ansi file , os.walk() , try match if file present in list (saved utf-8). not matching if present.
later normal checking single string "40m_Ãzµ´ú¸ÕÀÉ" , save particular string (from notepad) in 3 different files encoding option ansi, unicode , utf-8. write python script print:
print repr(string) print string
and output like:
ansi encoding
'40m_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' 40m_Ãzµ´ú¸ÕÀÉ
unicode encoding
'\x004\x000\x00m\x00_\x00\xc3\x00z\x00\xad\x00\xb5\x00\xb4\x00\xfa\x00\xb8\x00\xd5\x00\xc0\x00\xc9\x00' 4 0 m _ Ã z µ ´ ú ¸ Õ À É
utf-8 encoding
'40m_\xc3\x83z\xc2\xad\xc2\xb5\xc2\xb4\xc3\xba\xc2\xb8\xc3\x95\xc3\x80\xc3\x89' 40m_Ãzµ´ú¸ÕÀÉ
i can't understand how compare same string coming differently encoded file. please help.
ps: have typical unicode characters like: 唐朝小栗子第集.mp3 difficult handle.
i can't understand how compare same string coming differently encoded file.
notepad encoded character string 3 different encodings, resulting in 3 different byte sequences. retrieve character string must decode bytes using same encodings:
>>> ansi_bytes = '40m_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' >>> utf16_bytes = '4\x000\x00m\x00_\x00\xc3\x00z\x00\xad\x00\xb5\x00\xb4\x00\xfa\x00\xb8\x00\xd5\x00\xc0\x00\xc9\x00' >>> utf8_bytes = '40m_\xc3\x83z\xc2\xad\xc2\xb5\xc2\xb4\xc3\xba\xc2\xb8\xc3\x95\xc3\x80\xc3\x89' >>> ansi_bytes.decode('mbcs') u'40m_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40m_Ãzµ´ú¸ÕÀÉ >>> utf16_bytes.decode('utf-16le') u'40m_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40m_Ãzµ´ú¸ÕÀÉ >>> utf8_bytes.decode('utf-8') u'40m_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40m_Ãzµ´ú¸ÕÀÉ
‘ansi’ (not “asci”) windows (somewhat misleadingly) calls default locale-specific code page, in case 1252 (western european, can in python
windows-1252
) vary machine machine. can whatever encoding python on windows using namembcs
.‘unicode’ name windows uses utf-16le encoding (very misleadingly, because unicode character set standard , not kind of bytes⇔characters encoding in itself). unlike ansi , utf-8 not ascii-compatible encoding, attempt read line file has failed because line terminator in utf-16le not
\n
,\n\x00
. has left spurious\x00
@ start of byte string have above.‘utf-8’ @ least accurately named, windows likes put fake byte order marks @ front of “utf-8” files give unwanted
u'\ufeff'
character when decode them. if want accept “utf-8” files saved notepad can manually remove or use python'sutf-8-sig
encoding.
you can use codecs.open()
instead of open()
read file automatic unicode decoding. fixes utf-16 newline problem, because \n
characters detected after decoding instead of before.
i read directory path asci file , os.walk()
windows filenames natively handled unicode, when give windows byte string has guess encoding needed convert bytes characters. chooses ansi not utf-8. fine if using byte string file encoded in same machine's ansi encoding, in case limited filenames fit within machine's locale. in western european 40m_Ãzµ´ú¸ÕÀÉ
fit 唐朝小栗子第集.mp3
not wouldn't able refer chinese files @ all.
python supports passing unicode filenames directly windows, avoids problem (most other languages can't this). pass unicode string filesystem functions os.walk()
, should unicode strings out, instead of failure.
so, utf-8-encoded input files, like:
with codecs.open(u'directory_path.txt', 'rb', 'utf-8-sig') fp: directory_path = fp.readline().strip(u'\r\n') # unicode dir path good_names = set() codecs.open(u'filename_list.txt', 'rb', 'utf-8-sig') fp: line in fp: good_names.add(line.strip(u'\r\n')) # set of unicode file names dirpath, dirnames, filenames in os.walk(directory path): # names unicode strings filename in filenames: if filename in good_names: # file
Comments
Post a Comment