FIELD: computing technology.
SUBSTANCE: computer-implemented method for obtaining a vector representation of an electronic document, executed by means of a processing unit and including the stages of: generating a model of placement of m-skip-n-grams by clusters, wherein the generation of said model involves the following: determining the list of used m-skip-n-grams; converting each m-skip-n-gram from the list into a vector representation; clustering the m-skip-n-grams; processing the text document using the resulting model, involving: calculating the occurrence of m-skip-n-grams in the document; determining clusters of the document based on the occurrence of m-skip-n-grams; summarizing the amount of occurrences of m-skip-n-grams from each cluster; forming a vector representation of the document.
EFFECT: possibility of preserving different semantics of words in the document by matching words to multiple clusters.
10 cl, 6 dwg, 1 tbl
Title | Year | Author | Number |
---|---|---|---|
METHOD AND SYSTEM FOR OBTAINING VECTOR REPRESENTATION OF ELECTRONIC TEXT DOCUMENT FOR CLASSIFICATION BY CATEGORIES OF CONFIDENTIAL INFORMATION | 2021 |
|
RU2775358C1 |
METHOD AND SYSTEM FOR DETERMINING RESULT OF TASK EXECUTION IN CROWDSOURCED ENVIRONMENT | 2019 |
|
RU2744032C2 |
METHOD AND SYSTEM OF SEMANTIC PROCESSING TEXT DOCUMENTS | 2016 |
|
RU2630427C2 |
METHOD OF CONSTRUCTING SEMANTIC MODEL OF DOCUMENT | 2011 |
|
RU2487403C1 |
THEMATIC MODELS WITH A PRIORI TONALITY PARAMETERS BASED ON DISTRIBUTED REPRESENTATIONS | 2018 |
|
RU2719463C1 |
METHOD FOR GENERATING MATHEMATICAL MODELS OF A PATIENT USING ARTIFICIAL INTELLIGENCE TECHNIQUES | 2017 |
|
RU2720363C2 |
AUTOMATED LEGAL ADVICE SYSTEM CONTROL METHOD | 2019 |
|
RU2718978C1 |
METHOD OF CLASSIFYING DOCUMENTS BY CATEGORIES | 2012 |
|
RU2491622C1 |
AUTOMATIC DETERMINATION OF SET OF CATEGORIES FOR DOCUMENT CLASSIFICATION | 2018 |
|
RU2701995C2 |
AI TRANSACTION ADMINISTRATION SYSTEM | 2020 |
|
RU2777958C2 |
Authors
Dates
2022-06-29—Published
2021-06-01—Filed