Python reads unknown encoded files

Updated to 16 days ago

Python reads unknown encoded files

- background
- Test file description
- Use the chardet module to detect encoding
- - Coding detection of small files
  - Encoding detection of large files
- Combined detection encoding and reading content
- Reference Documents

background

When developing log analysis function, you need to read files with different encodings and parse the file contents. The first thing you need to solve is how to detect encodings.

Test file description

For the convenience of demonstration, first create 5 test files (the corresponding code for the file name): utf8-file, utf8bom-file, gbk-file, utf16le-file, utf16be-file. 5 files are uniformly written to the following content:

abcd
 1234
 One, two, three, four

Use the chardet module to detect encoding

chardetIt is a module for encoding detection, which can help us identify what encoding format a byte in an unknown format belongs to.

Coding detection of small files

The detect function of the chardet module accepts a non-unicode string parameter and returns a dictionary. This dictionary includes the detected encoding format and confidence.

>>> import chardet
>>> with open('utf8-file', 'rb') as f:
...     result = chardet.detect(f.read())
...     print(result)
...
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}

Encoding detection of large files

Considering that some files are very large, if you read them all in according to the above method and then judge the encoding format, the efficiency will become very low, so incremental detection is used. Here, we pass in a row of data to the detector each time. When the detector reaches the minimum confidence threshold, the detection result can be obtained. In this way, the content may be read less than the above method, which can reduce the detection time. Another advantage of this method is that reading the file content in chunks will not put too much pressure on the memory.

>>> import chardet
>>> from chardet.universaldetector import UniversalDetector
>>> detector = UniversalDetector()
>>> with open('utf8-file', 'rb') as f:
...     for line in f:
...         detector.feed(line)
...         if detector.done:
...             break
...     detector.close()
...     print(detector.result)
...
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}

Combined detection encoding and reading content

We encapsulate the detection encoding and reading file content into a function and test files in 5 encoding formats. The following code passes in parameters when creating the UniversalDetector object, which can make the detection result more accurate.

>>> import io
>>> import chardet
>>> from chardet.universaldetector import UniversalDetector, LanguageFilter
>>> def reading_unknown_encoding_file(filename):
...     detector = UniversalDetector(LanguageFilter.CHINESE)
...     with open(filename, 'rb') as f:
...         for line in f:
...             detector.feed(line)
...             if detector.done:
...                 break
...         detector.close()
...         encoding = detector.result['encoding']
...         f = io.TextIOWrapper(f, encoding=encoding)
...         f.seek(0)
...         for line in f:
...             print(repr(line))
...
>>> reading_unknown_encoding_file('utf8-file')
'abcd\n'
'1234\n'
'One, two, three, four'
>>> reading_unknown_encoding_file('utf8bom-file')
'abcd\n'
'1234\n'
'One, two, three, four'
>>> reading_unknown_encoding_file('gbk-file')
'abcd\n'
'1234\n'
'One, two, three, four'
>>> reading_unknown_encoding_file('utf16le-file')
'abcd\n'
'1234\n'
'One, two, three, four'
>>> reading_unknown_encoding_file('utf16be-file')
'abcd\n'
'1234\n'
'One, two, three, four'

Reference Documents

chardet documentation