FIELD: computing technology.
SUBSTANCE: disclosed is a system for augmentation of the training sample for machine learning algorithms, containing: at least one processor; at least one memory tool; input data processing module configured to receive the text data forming the initial training sample; data normalisation wherein the text is divided into sentences and cleared of characters; data vectorisation module configured to convert the normalised sentences into the vector form, wherein, in the course of said converting, each received sentence is split into minimally significant parts constituting words and punctuation marks; tokenisation of said minimally significant parts; forming of vector representations for each token; and forming of an averaged vector representation of a normalised sentence; a text data enrichment module containing a set of text data collected from open sources and metadata for vectorisation thereof and construction of a search index; a text index module configured to form a text index based on the vector representations of the text data; a training sample augmentation module configured to supplement and/or adjust the initial text sample based on the selection of relevant vector representations of tokens in the text data enrichment module using determination of the measure of token proximity in the vector space.
EFFECT: ensured selection of text data for augmentation of the training sample based on the characteristics of the text of the input training sample.
22 cl, 3 dwg
Title | Year | Author | Number |
---|---|---|---|
TEXT CLASSIFICATION METHOD AND SYSTEM | 2022 |
|
RU2818693C2 |
METHOD AND SYSTEM FOR GENERATING TEXT | 2023 |
|
RU2817524C1 |
METHOD AND SYSTEM FOR DIGITAL ASSISTANT TEXT GENERATION | 2022 |
|
RU2796208C1 |
METHOD AND SYSTEM FOR PARAPHRASING TEXT | 2023 |
|
RU2814808C1 |
METHOD AND SYSTEM FOR DEPERSONALIZATION OF CONFIDENTIAL DATA | 2022 |
|
RU2804747C1 |
METHOD AND SYSTEM FOR DEPERSONALIZATION OF CONFIDENTIAL DATA | 2022 |
|
RU2802549C1 |
SYSTEM AND METHOD FOR AUTOMATED ASSESSMENT OF INTENTIONS AND EMOTIONS OF USERS OF DIALOGUE SYSTEM | 2020 |
|
RU2762702C2 |
AUTOMATED LEGAL ADVICE SYSTEM CONTROL METHOD | 2019 |
|
RU2718978C1 |
METHOD OF CREATING MODEL FOR ANALYSING DIALOGUES BASED ON ARTIFICIAL INTELLIGENCE FOR PROCESSING USER REQUESTS AND SYSTEM USING SUCH MODEL | 2019 |
|
RU2730449C2 |
METHOD OF TRAINED RECURRENT NEURAL NETWORK DEBUGGING | 2019 |
|
RU2715024C1 |
Authors
Dates
2021-11-01—Published
2020-04-28—Filed