FIELD: information technology.
SUBSTANCE: method for textual information recognition and its integrity evaluation in Internet electronic documents an electronic document is split into areas presumptively containing text paragraphs and lines. Herewith, document splitting is performed up to obtaining the areas containing continuous logically bracketed text of the largest size. Redundant and surplus information it deleted. Symbol encoding validity is analysed by means of the analysis whether letters belong to the alphabet or not and whether text words belong to the vocabulary or not, taking into account the given language. Statistical characteristics of word classes and their forms are calculated. From the obtained values of statistical characteristics a working vocabulary attribute vector is generated, which is converted into the main components vector using componential analysis procedures and classified using preliminarily learned classifiers. Textual information integrity is evaluated based on a voting method of decision making.
EFFECT: higher productivity of an electronic documents contensive processing system and increase in the analysed data sources number.
5 dwg
Title | Year | Author | Number |
---|---|---|---|
METHOD OF DETERMINING PROFILE OF MOBILE DEVICE USER ON MOBILE DEVICE ITSELF AND DEMOGRAPHIC PROFILING SYSTEM | 2016 |
|
RU2647661C1 |
METHOD FOR ORDERING DATA SUBMITTED IN ALPHANUMERIC INFORMATION BLOCKS | 2000 |
|
RU2210809C2 |
METHOD AND SYSTEM FOR CLASSIFYING AND FILTERING PROHIBITED CONTENT IN A NETWORK | 2020 |
|
RU2738335C1 |
USE OF AUTOENCODERS FOR LEARNING TEXT CLASSIFIERS IN NATURAL LANGUAGE | 2017 |
|
RU2678716C1 |
METHOD AND SYSTEM FOR DEPERSONALIZATION OF CONFIDENTIAL DATA | 2022 |
|
RU2804747C1 |
METHOD AND SYSTEM FOR DEPERSONALIZATION OF CONFIDENTIAL DATA | 2022 |
|
RU2802549C1 |
METHOD AND SYSTEM FOR GENERATING AN OBJECT CARD | 2018 |
|
RU2739554C1 |
DEVICES AND METHODS, WHICH BUILD THE HIERARCHIALLY ORDINARY DATA STRUCTURE, CONTAINING NONPARAMETERIZED SYMBOLS FOR DOCUMENTS IMAGES CONVERSION TO ELECTRONIC DOCUMENTS | 2013 |
|
RU2625533C1 |
METHOD FOR AUTOMATIC CLASSIFICATION OF FORMALIZED ELECTRONIC GRAPHIC AND TEXT DOCUMENTS IN THE ELECTRONIC DOCUMENT CIRCULATION SYSTEM WITH AUTOMATIC FORMATION OF ELECTRONIC CASES | 2020 |
|
RU2759887C1 |
METHOD AND SYSTEM FOR EXTRACTING NAMED ENTITIES | 2021 |
|
RU2823914C2 |
Authors
Dates
2015-05-10—Published
2013-12-11—Filed