FIELD: information technologies.
SUBSTANCE: in the method of automatic classification of formalised documents in an electronic document circulation system they identify and analyse characteristics of identical text sections (details) in a formalised document, and identified details are analysed. The informative part of the document is converted into text in natural language, document words are transformed into basic wordforms, insignificant words are deleted, word weights are counted in accordance with frequency of their occurrence, forming predicates of text criteria identification. According to the proposed set of manually classified texts they generate a system of predicates of text criteria identification, which is saved in a data base. Values of significant wordform weights are added into the system of predicates. If it is necessary to use a priori information on dependences of information areas between each other, algebra of end predicates is used, which makes it possible to perform operations over logical expressions, with the help of which information areas are described.
EFFECT: reduced time of system operation through making it possible to classify documents by form and identified metadata and to perform analysis only in the informative part of the document.
1 dwg
Authors
Dates
2015-04-10—Published
2013-12-11—Filed