FIELD: data processing.
SUBSTANCE: invention relates to machine learning and more specifically to a method of recognizing the nature of text content. Method comprises steps of: generating an initial set of text data sources containing content of a predetermined subject matter, wherein each source is assigned at least one content nature label and at least one content subject label; automatically performing parsing of each source in a set of sources to identify the author of the source, identify links to third-party sources and identifying the location in which the source author is located and/or in which the source is published, wherein sources not included in available set of sources are considered as third-party sources, wherein links to third-party sources are the names of third-party sources and URL-links to third-party sources; searching for said third-party sources by identified links; third-party sources are searched by the identified authors, wherein the third-party sources are searched based on the identified locations; selecting, from the found third-party sources, sources whose subject matter is close to at least one of the content subjects of the initial set of sources; automatically assigning to selected sources corresponding content subject labels; forming an additional set of sources from the selected sources; each source from the additional set of sources is automatically assigned at least one content character label by comparing said source with sources from the source set having the same subject matter as the given source; generating a training set of sources by combining an initial set of sources and a marked additional set of sources; and performing machine training of the content nature recognition model using the training set of sources.
EFFECT: high accuracy and speed of obtaining a result.
4 cl
Title | Year | Author | Number |
---|---|---|---|
METHOD OF RECOGNIZING NATURE OF TEXT CONTENT | 2023 |
|
RU2827987C1 |
DISTRIBUTED LEARNING MACHINE LEARNING MODELS FOR PERSONALIZATION | 2018 |
|
RU2702980C1 |
METHOD AND SYSTEM FOR CHECKING MEDIA CONTENT | 2022 |
|
RU2815896C2 |
RETRIEVAL OF INFORMATION OBJECTS USING A COMBINATION OF CLASSIFIERS ANALYZING LOCAL AND NON-LOCAL SIGNS | 2018 |
|
RU2686000C1 |
METHOD AND SYSTEM FOR CREATING BRIEF SUMMARY OF DIGITAL CONTENT | 2016 |
|
RU2637998C1 |
NAMED ENTITIES FROM THE TEXT AUTOMATIC EXTRACTION | 2014 |
|
RU2665239C2 |
SYSTEM FOR IDENTIFYING REPHRASING USING MACHINE TRANSLATION TECHNOLOGY | 2004 |
|
RU2368946C2 |
SYSTEM AND METHOD FOR AUGMENTATION OF THE TRAINING SAMPLE FOR MACHINE LEARNING ALGORITHMS | 2020 |
|
RU2758683C2 |
METHODS AND SERVERS FOR TRAINING MODEL TO DETECT SPEAKER CHANGE | 2024 |
|
RU2841235C1 |
MULTISTAGE TRAINING OF MACHINE LEARNING MODELS FOR RANKING SEARCH RESULTS | 2021 |
|
RU2824338C2 |
Authors
Dates
2025-04-21—Published
2024-05-15—Filed