FIELD: information technology.
SUBSTANCE: method of classifying documents by categories includes constructing ontology in form of a set of categories. For each category, terms, i.e. sequences of words typical for texts in said category, are identified and the weight of each of the identified terms is determined when reading electronic versions of the documents from a training collection of documents. A profile is formed for each of the categories in form of a list of all terms in all ontology categories with indication of the weight of each term in said category. A list of possible combinations word forms of said term is compiled for each term. Identified terms are selected in each document to be classified when reading an electronic version thereof, considering only word forms from the compiled list. For each document to be classified, a profile is formed for each category based on the selected terms. Relevance of said document to each category is determined by comparing profiles of said document with profiles of categories in the ontology. A classification spectrum of the document is constructed in form of a set of categories with relevance found for each of them.
EFFECT: high rate of classification and reduced size of consumed memory.
7 cl
Authors
Dates
2013-08-27—Published
2012-01-25—Filed