FIELD: physics.
SUBSTANCE: document image is received in the method of determining, whether the text contains Chinese, Japanese or Korean characters. The received document image is binarized. The connected components are searched on the binarized document image. Based on the received connected components, the set of fragments is detected and the document orientation is determined. The hypothesis of the language affiliation is formulated for each fragment from the set of fragments. The probability assessment is calculated for the hypothesis of the language affiliation. The set is selected from the set of fragments having the highest probability assessments. The hypothesis of the language affiliation is verified for each fragment from the subset of fragments. The decision about the presence of Chinese, Japanese and Korean characters is made on the basis of, at least, testing the hypothesis about the fragment language of the selected subset.
EFFECT: increasing the accuracy of determining the presence of Chinese, Japanese or Korean characters in the text.
20 cl, 7 dwg
Authors
Dates
2017-03-21—Published
2013-12-20—Filed