Python reads unknown encoded files
- background
- Test file description
- Use the chardet module to detect encoding
- Coding detection of small files
- Encoding detection of large files
- Combined detection encoding and reading content
- Reference Documents
background
When developing log analysis function, you need to read files with different encodings and parse the file contents. The first thing you need to solve is how to detect encodings.
Test file description
For the convenience of demonstration, first create 5 test files (the corresponding code for the file name): utf8-file, utf8bom-file, gbk-file, utf16le-file, utf16be-file. 5 files are uniformly written to the following content:
abcd
1234
One, two, three, four
Use the chardet module to detect encoding
chardetIt is a module for encoding detection, which can help us identify what encoding format a byte in an unknown format belongs to.
Coding detection of small files
The detect function of the chardet module accepts a non-unicode string parameter and returns a dictionary. This dictionary includes the detected encoding format and confidence.
>>> import chardet
>>> with open('utf8-file', 'rb') as f:
... result = chardet.detect(f.read())
... print(result)
...
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
Encoding detection of large files
Considering that some files are very large, if you read them all in according to the above method and then judge the encoding format, the efficiency will become very low, so incremental detection is used. Here, we pass in a row of data to the detector each time. When the detector reaches the minimum confidence threshold, the detection result can be obtained. In this way, the content may be read less than the above method, which can reduce the detection time. Another advantage of this method is that reading the file content in chunks will not put too much pressure on the memory.
>>> import chardet
>>> from chardet.universaldetector import UniversalDetector
>>> detector = UniversalDetector()
>>> with open('utf8-file', 'rb') as f:
... for line in f:
... detector.feed(line)
... if detector.done:
... break
... detector.close()
... print(detector.result)
...
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
Combined detection encoding and reading content
We encapsulate the detection encoding and reading file content into a function and test files in 5 encoding formats. The following code passes in parameters when creating the UniversalDetector object, which can make the detection result more accurate.
>>> import io
>>> import chardet
>>> from chardet.universaldetector import UniversalDetector, LanguageFilter
>>> def reading_unknown_encoding_file(filename):
... detector = UniversalDetector(LanguageFilter.CHINESE)
... with open(filename, 'rb') as f:
... for line in f:
... detector.feed(line)
... if detector.done:
... break
... detector.close()
... encoding = detector.result['encoding']
... f = io.TextIOWrapper(f, encoding=encoding)
... f.seek(0)
... for line in f:
... print(repr(line))
...
>>> reading_unknown_encoding_file('utf8-file')
'abcd\n'
'1234\n'
'One, two, three, four'
>>> reading_unknown_encoding_file('utf8bom-file')
'abcd\n'
'1234\n'
'One, two, three, four'
>>> reading_unknown_encoding_file('gbk-file')
'abcd\n'
'1234\n'
'One, two, three, four'
>>> reading_unknown_encoding_file('utf16le-file')
'abcd\n'
'1234\n'
'One, two, three, four'
>>> reading_unknown_encoding_file('utf16be-file')
'abcd\n'
'1234\n'
'One, two, three, four'
Reference Documents
chardet documentation