KEncodingDetector Class Reference
from PyKDE4.kdecore import *
Detailed Description
Provides encoding detection capabilities.
Searches for encoding declaration inside raw data -- meta and xml tags. In the case it can't find it, uses heuristics for specified language.
If it finds unicode BOM marks, it changes encoding regardless of what the user has told
Intended lifetime of the object: one instance per document.
Typical use:
QByteArray data; ... KEncodingDetector detector; detector.setAutoDetectLanguage(KEncodingDetector.Cyrillic); QString out=detector.decode(data);
Do not mix decode() with decodeWithBuffering()
Guess encoding of char array
Enumerations | |
AutoDetectScript | { None, SemiautomaticDetection, Arabic, Baltic, CentralEuropean, ChineseSimplified, ChineseTraditional, Cyrillic, Greek, Hebrew, Japanese, Korean, NorthernSaami, SouthEasternEurope, Thai, Turkish, Unicode, WesternEuropean } |
EncodingChoiceSource | { DefaultEncoding, AutoDetectedEncoding, BOM, EncodingFromXMLHeader, EncodingFromMetaTag, EncodingFromHTTPHeader, UserChosenEncoding } |
Methods | |
__init__ (self) | |
__init__ (self, QTextCodec codec, KEncodingDetector.EncodingChoiceSource source, KEncodingDetector.AutoDetectScript script=KEncodingDetector.None) | |
__init__ (self, KEncodingDetector other) | |
bool | analyze (self, QString data, int len) |
KEncodingDetector.AutoDetectScript | autoDetectLanguage (self) |
QString | decode (self, QString data, int len) |
QString | decode (self, QByteArray data) |
QString | decodeWithBuffering (self, QString data, int len) |
bool | decodedInvalidCharacters (self) |
QTextDecoder | decoder (self) |
QString | encoding (self) |
KEncodingDetector.EncodingChoiceSource | encodingChoiceSource (self) |
bool | errorsIfUtf8 (self, QString data, int length) |
QString | flush (self) |
bool | processNull (self, QString data, int length) |
resetDecoder (self) | |
setAutoDetectLanguage (self, KEncodingDetector.AutoDetectScript a0) | |
bool | setEncoding (self, QString encoding, KEncodingDetector.EncodingChoiceSource type) |
bool | visuallyOrdered (self) |
Static Methods | |
bool | hasAutoDetectionForScript (KEncodingDetector.AutoDetectScript a0) |
QString | nameForScript (KEncodingDetector.AutoDetectScript a0) |
KEncodingDetector.AutoDetectScript | scriptForName (QString lang) |
Method Documentation
__init__ | ( | self ) |
Default codec is latin1 (as html spec says), EncodingChoiceSource is default, AutoDetectScript=Semiautomatic
__init__ | ( | self, | ||
QTextCodec | codec, | |||
KEncodingDetector.EncodingChoiceSource | source, | |||
KEncodingDetector.AutoDetectScript | script=KEncodingDetector.None | |||
) |
Allows to set Default codec, EncodingChoiceSource, AutoDetectScript
__init__ | ( | self, | ||
KEncodingDetector | other | |||
) |
bool analyze | ( | self, | ||
QString | data, | |||
int | len | |||
) |
Analyze text data.
- Returns:
- true if there was enough data for accurate detection
KEncodingDetector.AutoDetectScript autoDetectLanguage | ( | self ) |
The main class method
Calls protected analyze() only the first time of the whole object life
Replaces all null chars with spaces.
QString decode | ( | self, | ||
QByteArray | data | |||
) |
The main class method
Calls protected analyze() only the first time of the whole object life
Replaces all null chars with spaces.
Convenience method that uses buffering. It waits for full html head to be buffered (i.e. calls analyze every time until it returns true).
Replaces all null chars with spaces.
- Returns:
- Decoded data, or empty string, if there was not enough data for accurate detection
- See also:
- flush()
bool decodedInvalidCharacters | ( | self ) |
This method checks whether invalid characters were found during a decoding operation.
Note that this bit is never reset once invalid characters have been found. To force a reset, either change the encoding using setEncoding() or call resetDecoder()
- Returns:
- a boolean reflecting said state.
- Since:
- 4.3
- See also:
- resetDecoder() setEncoding()
QTextDecoder decoder | ( | self ) |
- Returns:
- QTextDecoder for detected encoding
QString encoding | ( | self ) |
Convenience method.
- Returns:
- mime name of detected encoding
KEncodingDetector.EncodingChoiceSource encodingChoiceSource | ( | self ) |
bool errorsIfUtf8 | ( | self, | ||
QString | data, | |||
int | length | |||
) |
Check if we are really utf8. Taken from kate
- Returns:
- true if current encoding is utf8 and the text cannot be in this encoding
Please somebody read http://de.wikipedia.org/wiki/UTF-8 and check this code...
QString flush | ( | self ) |
Convenience method to be used with decodeForHtml. Flushes buffer.
- See also:
- decodeForHtml()
bool processNull | ( | self, | ||
QString | data, | |||
int | length | |||
) |
This nice method will kill all 0 bytes (or double bytes) and remember if this was a binary or not ;)
resetDecoder | ( | self ) |
Resets the decoder. Any stateful decoding information (such as resulting from previous calls to decodeWithBuffering()) will be lost. Will Reset the state of decodedInvalidCharacters() as a side effect.
- Since:
- 4.3
- See also:
- decodeWithBuffering() decodedInvalidCharacters()
setAutoDetectLanguage | ( | self, | ||
KEncodingDetector.AutoDetectScript | a0 | |||
) |
bool setEncoding | ( | self, | ||
QString | encoding, | |||
KEncodingDetector.EncodingChoiceSource | type | |||
) |
- Returns:
- true if specified encoding was recognized
bool visuallyOrdered | ( | self ) |
Static Method Documentation
bool hasAutoDetectionForScript | ( | KEncodingDetector.AutoDetectScript | a0 | |
) |
QString nameForScript | ( | KEncodingDetector.AutoDetectScript | a0 | |
) |
KEncodingDetector.AutoDetectScript scriptForName | ( | QString | lang | |
) |
Takes lang name _after_ it were i18n()'ed
Enumeration Documentation
AutoDetectScript |
- Enumerator:
-
None SemiautomaticDetection Arabic Baltic CentralEuropean ChineseSimplified ChineseTraditional Cyrillic Greek Hebrew Japanese Korean NorthernSaami SouthEasternEurope Thai Turkish Unicode WesternEuropean
EncodingChoiceSource |
- Enumerator:
-
DefaultEncoding AutoDetectedEncoding BOM EncodingFromXMLHeader EncodingFromMetaTag EncodingFromHTTPHeader UserChosenEncoding