FIELD: data processing.
SUBSTANCE: invention relates to a method for marking and verifying text data. In the method, at the first stage, the deep learning language model is preliminary trained on the prepared data corpus, which includes collections of texts of a wide thematic focus, at the second stage, the text data relevant to the problem being solved are marked using the program interface by selecting fragments of text of an arbitrary length, assigning the marked data to various user-defined categories, which are used as an additional training sample for the language model, at the third stage, preliminary processing of marked data is performed, at the fourth stage, the language model is trained based on the newly marked data and the marked data is vectorized, at the fifth step, categories are predicted on a plurality of unlabelled data using a classifier model coupled with a language model, wherein metrics are generated, reflecting the degree of uncertainty of the model for each category, a strategy for selecting objects from the sample is used, a degree of information content is assigned to each object based on the metrics, after which the most informative objects are selected for the expert assessment, wherein the maximum entropy and the minimum confidence are used as the uncertainty metrics, as well as a "category duplication" metric, reflecting the degree of uncertainty when assigning data belonging to one category to another category, wherein the calculation of this metric is carried out by calculating the average confidence of the model for one type of categories by marking for the other type of categories, after which the sequence of actions from the second to the fifth steps is repeated until a consensus is reached between the assessment of the expert and the uncertainty metrics for all objects provided for assessment and their predicted categories, wherein the choice of the moment of consensus is determined by the expert.
EFFECT: possibility of more accurate marking of a text document.
3 cl, 1 dwg
Title | Year | Author | Number |
---|---|---|---|
AUTOMATED LEGAL ADVICE SYSTEM CONTROL METHOD | 2019 |
|
RU2718978C1 |
ALLOCATION OF TIME EXPRESSIONS FOR TEXTS IN NATURAL LANGUAGE | 2014 |
|
RU2595489C2 |
SYSTEM FOR AUTOMATIC DETERMINATION OF SUBJECT MATTER OF TEXT DOCUMENTS BASED ON EXPLICABLE ARTIFICIAL INTELLIGENCE METHODS | 2023 |
|
RU2823436C1 |
NAMED ENTITIES FROM THE TEXT AUTOMATIC EXTRACTION | 2014 |
|
RU2665239C2 |
METHOD FOR ATTRIBUTION OF PARTIALLY STRUCTURED TEXTS FOR FORMATION OF NORMATIVE-REFERENCE INFORMATION | 2020 |
|
RU2750852C1 |
METHOD FOR CONTROLLING A DIALOGUE AND NATURAL LANGUAGE RECOGNITION SYSTEM IN A PLATFORM OF VIRTUAL ASSISTANTS | 2020 |
|
RU2759090C1 |
TEXT SEGMENTATION | 2017 |
|
RU2666277C1 |
TRAINING CLASSIFIERS USED TO EXTRACT INFORMATION FROM NATURAL LANGUAGE TEXTS | 2018 |
|
RU2691855C1 |
CLASSIFIER TRAINING USED FOR EXTRACTING INFORMATION FROM TEXTS IN NATURAL LANGUAGE | 2018 |
|
RU2681356C1 |
METHOD FOR OBTAINING LOW-DIMENSIONAL NUMERIC REPRESENTATIONS OF SEQUENCES OF EVENTS | 2020 |
|
RU2741742C1 |
Authors
Dates
2025-01-09—Published
2023-12-26—Filed