FIELD: information technology.
SUBSTANCE: method for textual information recognition and its integrity evaluation in Internet electronic documents an electronic document is split into areas presumptively containing text paragraphs and lines. Herewith, document splitting is performed up to obtaining the areas containing continuous logically bracketed text of the largest size. Redundant and surplus information it deleted. Symbol encoding validity is analysed by means of the analysis whether letters belong to the alphabet or not and whether text words belong to the vocabulary or not, taking into account the given language. Statistical characteristics of word classes and their forms are calculated. From the obtained values of statistical characteristics a working vocabulary attribute vector is generated, which is converted into the main components vector using componential analysis procedures and classified using preliminarily learned classifiers. Textual information integrity is evaluated based on a voting method of decision making.
EFFECT: higher productivity of an electronic documents contensive processing system and increase in the analysed data sources number.
5 dwg
Authors
Dates
2015-05-10—Published
2013-12-11—Filed