FIELD: information processing.
SUBSTANCE: invention relates to methods and a server for processing a text sequence in a machine processing task. In the method, a token dictionary is received by the server, which stores a set of tokens from a predefined text corpus, while a token from the set of tokens is a single symbol or a merged set of tokens; a merge table is received by the server, indicating possible merges of token pairs from the set of tokens, while the token from the possible merge is associated with a frequency of occurrence of this token in the predefined text corpus; a text sequence is received by the server, indicating at least one word. For a word from the text sequence: the token dictionary is used by the server to divide the word into an initial token sequence representing individual symbols of this word; tokens from the initial token sequence are iteratively merged by the server in order to form a final token sequence for this word. At the same time, iterative merging includes: at the current merge iteration: the merge table is used by the server to determine a set of possible merges of pairs of neighboring tokens from a current token sequence for this merge iteration; at least one merge is excluded by the server from the set of possible merges based on the probability of exclusion, and thus, a reduced set of possible merges is formed for this merge iteration, while the reduced set of possible merges is smaller than the set of possible merges. The reduced set of possible merges is used by the server to form a new token sequence by performing at least one merge from the reduced set of possible merges in the current token sequence, while the new token sequence is to be used by the server as the current token sequence at the next merge iteration. At another merge iteration, after the current merge iteration: the current token sequence is determined by the server for another merge iteration as the final token sequence to be used in the machine processing task, in the absence of possible merges in the current token sequence for another merge iteration.
EFFECT: increase in the efficiency of training data preparation due to obtaining several options of word segmentation.
30 cl, 4 dwg
Title | Year | Author | Number |
---|---|---|---|
METHOD AND SERVER FOR PERFORMING PROBLEM-ORIENTED TRANSLATION | 2021 |
|
RU2820953C2 |
METHOD AND SERVER FOR PERFORMING CONTEXT-SENSITIVE TRANSLATION | 2021 |
|
RU2812301C2 |
METHODS AND ELECTRONIC DEVICES FOR PACKAGING REQUESTS INTENDED FOR PROCESSING BY PROCESSING UNIT | 2021 |
|
RU2810916C2 |
METHOD AND SYSTEM FOR EXTRACTING NAMED ENTITIES | 2021 |
|
RU2823914C2 |
METHOD AND SERVER FOR TRAINING MACHINE LEARNING ALGORITHM IN TRANSLATION | 2020 |
|
RU2770569C2 |
METHOD AND SERVER FOR TEACHING A NEURAL NETWORK TO FORM A TEXT OUTPUT SEQUENCE | 2020 |
|
RU2798362C2 |
METHOD AND SERVER FOR TRAINING MACHINE LEARNING ALGORITHM FOR TRANSLATION | 2020 |
|
RU2789796C2 |
METHOD AND DEVICE FOR VEHICLE CONTROL | 2021 |
|
RU2767826C1 |
METHOD AND APPARATUS FOR TRAINING MACHINE LEARNING ALGORITHM (MLA) FOR CREATING CONTENT RECOMMENDATIONS IN A RECOMMENDATION SYSTEM AND A METHOD AND APPARATUS FOR CREATING RECOMMENDED CONTENT USING A MACHINE LEARNING ALGORITHM | 2016 |
|
RU2731659C2 |
TEXT CLASSIFICATION METHOD AND SYSTEM | 2022 |
|
RU2818693C2 |
Authors
Dates
2022-07-11—Published
2020-04-24—Filed