unicode - Output difference after reading files saved in different encoding option in python -


i have unicode string list file, saved in encode option utf-8. have input file, saved in normal ansi. read directory path ansi file , os.walk() , try match if file present in list (saved utf-8). not matching if present.

later normal checking single string "40m_Ãz­µ´ú¸ÕÀÉ" , save particular string (from notepad) in 3 different files encoding option ansi, unicode , utf-8. write python script print:

print repr(string) print string 

and output like:

ansi encoding

'40m_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' 40m_Ãz­µ´ú¸ÕÀÉ 

unicode encoding

'\x004\x000\x00m\x00_\x00\xc3\x00z\x00\xad\x00\xb5\x00\xb4\x00\xfa\x00\xb8\x00\xd5\x00\xc0\x00\xc9\x00'  4 0 m _ Ã z ­µ ´ ú ¸ Õ À É 

utf-8 encoding

'40m_\xc3\x83z\xc2\xad\xc2\xb5\xc2\xb4\xc3\xba\xc2\xb8\xc3\x95\xc3\x80\xc3\x89' 40m_Ãz­µ´ú¸ÕÀÉ 

i can't understand how compare same string coming differently encoded file. please help.

ps: have typical unicode characters like: 唐朝小栗子第集.mp3 difficult handle.

i can't understand how compare same string coming differently encoded file.

notepad encoded character string 3 different encodings, resulting in 3 different byte sequences. retrieve character string must decode bytes using same encodings:

>>> ansi_bytes  = '40m_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' >>> utf16_bytes = '4\x000\x00m\x00_\x00\xc3\x00z\x00\xad\x00\xb5\x00\xb4\x00\xfa\x00\xb8\x00\xd5\x00\xc0\x00\xc9\x00' >>> utf8_bytes  = '40m_\xc3\x83z\xc2\xad\xc2\xb5\xc2\xb4\xc3\xba\xc2\xb8\xc3\x95\xc3\x80\xc3\x89'  >>> ansi_bytes.decode('mbcs') u'40m_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40m_Ãz­µ´ú¸ÕÀÉ >>> utf16_bytes.decode('utf-16le') u'40m_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40m_Ãz­µ´ú¸ÕÀÉ >>> utf8_bytes.decode('utf-8') u'40m_\xc3z\xad\xb5\xb4\xfa\xb8\xd5\xc0\xc9' # 40m_Ãz­µ´ú¸ÕÀÉ 
  • ‘ansi’ (not “asci”) windows (somewhat misleadingly) calls default locale-specific code page, in case 1252 (western european, can in python windows-1252) vary machine machine. can whatever encoding python on windows using name mbcs.

  • ‘unicode’ name windows uses utf-16le encoding (very misleadingly, because unicode character set standard , not kind of bytes⇔characters encoding in itself). unlike ansi , utf-8 not ascii-compatible encoding, attempt read line file has failed because line terminator in utf-16le not \n, \n\x00. has left spurious \x00 @ start of byte string have above.

  • ‘utf-8’ @ least accurately named, windows likes put fake byte order marks @ front of “utf-8” files give unwanted u'\ufeff' character when decode them. if want accept “utf-8” files saved notepad can manually remove or use python's utf-8-sig encoding.

you can use codecs.open() instead of open() read file automatic unicode decoding. fixes utf-16 newline problem, because \n characters detected after decoding instead of before.

i read directory path asci file , os.walk()

windows filenames natively handled unicode, when give windows byte string has guess encoding needed convert bytes characters. chooses ansi not utf-8. fine if using byte string file encoded in same machine's ansi encoding, in case limited filenames fit within machine's locale. in western european 40m_Ãz­µ´ú¸ÕÀÉ fit 唐朝小栗子第集.mp3 not wouldn't able refer chinese files @ all.

python supports passing unicode filenames directly windows, avoids problem (most other languages can't this). pass unicode string filesystem functions os.walk() , should unicode strings out, instead of failure.

so, utf-8-encoded input files, like:

with codecs.open(u'directory_path.txt', 'rb', 'utf-8-sig') fp:     directory_path = fp.readline().strip(u'\r\n') # unicode dir path  good_names = set() codecs.open(u'filename_list.txt', 'rb', 'utf-8-sig') fp:     line in fp:         good_names.add(line.strip(u'\r\n')) # set of unicode file names  dirpath, dirnames, filenames in os.walk(directory path): # names unicode strings     filename in filenames:         if filename in good_names:             # file 

Comments

Popular posts from this blog

c# - How to get the current UAC mode -

postgresql - Lazarus + Postgres: incomplete startup packet -

javascript - Ajax jqXHR.status==0 fix error -