Show metadata Hide metadata

(19)

(11)

2 832 840

(13)

(51)

IPC

G06F18/00(2023-01-01)

G06F40/10(2020-01-01)

(21) (22)

Application

2023135140, 2023-12-26

(24)

Start date

2023-12-26

(22)

Actual filing date

2023-12-26

(45)

Published

2025-01-09

(72)

Inventor

Pantin Aleksej IvanovichKorobejnikov Aleksej Andreevich

(73)

Holder

Federalnoe Gosudarstvennoe Avtonomnoe Obrazovatelnoe Uchrezhdenie Vysshego Obrazovaniya Issledovatelskij Tekhnologicheskij Universitet

METHOD OF MARKING AND VERIFYING TEXT DATA Russian patent published in 2025 - IPC G06F18/00 G06F40/10

Abstract RU 2832840 C1

FIELD: data processing.

SUBSTANCE: invention relates to a method for marking and verifying text data. In the method, at the first stage, the deep learning language model is preliminary trained on the prepared data corpus, which includes collections of texts of a wide thematic focus, at the second stage, the text data relevant to the problem being solved are marked using the program interface by selecting fragments of text of an arbitrary length, assigning the marked data to various user-defined categories, which are used as an additional training sample for the language model, at the third stage, preliminary processing of marked data is performed, at the fourth stage, the language model is trained based on the newly marked data and the marked data is vectorized, at the fifth step, categories are predicted on a plurality of unlabelled data using a classifier model coupled with a language model, wherein metrics are generated, reflecting the degree of uncertainty of the model for each category, a strategy for selecting objects from the sample is used, a degree of information content is assigned to each object based on the metrics, after which the most informative objects are selected for the expert assessment, wherein the maximum entropy and the minimum confidence are used as the uncertainty metrics, as well as a "category duplication" metric, reflecting the degree of uncertainty when assigning data belonging to one category to another category, wherein the calculation of this metric is carried out by calculating the average confidence of the model for one type of categories by marking for the other type of categories, after which the sequence of actions from the second to the fifth steps is repeated until a consensus is reached between the assessment of the expert and the uncertainty metrics for all objects provided for assessment and their predicted categories, wherein the choice of the moment of consensus is determined by the expert.

EFFECT: possibility of more accurate marking of a text document.

3 cl, 1 dwg

Similar patents RU2832840C1

Title	Year	Author	Number
AUTOMATED LEGAL ADVICE SYSTEM CONTROL METHOD	2019	Prikhodko Olga Viktorovna Khyurri Ruslan Vladimirovich Prikhodko Olga Viktorovna	RU2718978C1
ALLOCATION OF TIME EXPRESSIONS FOR TEXTS IN NATURAL LANGUAGE	2014	Romanenko Aleksandr Aleksandrovich	RU2595489C2
SYSTEM FOR AUTOMATIC DETERMINATION OF SUBJECT MATTER OF TEXT DOCUMENTS BASED ON EXPLICABLE ARTIFICIAL INTELLIGENCE METHODS	2023	Sochenkov Ilia Vladimirovich Zhebel Vladimir Viktorovich Zubarev Denis Vladimirovich Deviatkin Dmitrii Alekseevich Iadrintsev Vasilii Vladimirovich	RU2823436C1
NAMED ENTITIES FROM THE TEXT AUTOMATIC EXTRACTION	2014	Nekhaj Ilya Vladimirovich	RU2665239C2
METHOD FOR ATTRIBUTION OF PARTIALLY STRUCTURED TEXTS FOR FORMATION OF NORMATIVE-REFERENCE INFORMATION	2020	Fedosin Sergei Alekseevich Plotnikova Natalia Pavlovna Martynov Vladislav Aleksandrovich Ryskin Konstantin Eduardovich Kuznetsov Dmitrii Aleksandrovich Deniskin Aleksandr Vladimirovich Vechkanova Iuliia Sergeevna Fediushkin Nikolai Alekseevich Tsilikov Nikita Sergeevich	RU2750852C1
METHOD FOR CONTROLLING A DIALOGUE AND NATURAL LANGUAGE RECOGNITION SYSTEM IN A PLATFORM OF VIRTUAL ASSISTANTS	2020	Ashmanov Stanislav Igorevich Sukhachev Pavel Sergeevich Zorkij Fedor Kirillovich	RU2759090C1
TEXT SEGMENTATION	2017	Indenbom Evgenij Mikhajlovich Kolotienko Sergej Sergeevich	RU2666277C1
TRAINING CLASSIFIERS USED TO EXTRACT INFORMATION FROM NATURAL LANGUAGE TEXTS	2018	Matskevich Stepan Evgenevich Bulgakov Ilya Aleksandrovich	RU2691855C1
CLASSIFIER TRAINING USED FOR EXTRACTING INFORMATION FROM TEXTS IN NATURAL LANGUAGE	2018	Matskevich Stepan Evgenevich Bulgakov Ilya Aleksandrovich	RU2681356C1
METHOD FOR OBTAINING LOW-DIMENSIONAL NUMERIC REPRESENTATIONS OF SEQUENCES OF EVENTS	2020	Babaev Dmitrij Leonidovich Ovsov Nikita Pavlovich Kireev Ivan Aleksandrovich	RU2741742C1

RU 2 832 840 C1

Authors

Pantin Aleksej Ivanovich

Korobejnikov Aleksej Andreevich

Dates

2025-01-09—Published

2023-12-26—Filed