FIELD: computer engineering.
SUBSTANCE: invention relates to increasing the accuracy of collecting and processing text information from a web page. It is achieved due to the analyser module for searching domain names on the Internet containing news sources, analysing the HTML code to identify news feeds, extracting a link to the text of the news source, transferring the identified links, their type and processing algorithm to the database; a scraping module for processing data using a web resource mark-up analysis algorithm; parsing module for receiving HTML code from the scraping module, extracting text from HTML code using two text data collection algorithms, each of which selects the HTML node with the largest ratio of characters characterizing the connected text of the news source to their total number, processing the results of extracting algorithms by a machine learning model for analysing the presence of characteristics of sources that are not news and detecting a semantically coherent text that characterizes a news source.
EFFECT: increasing the accuracy of collecting and processing text information from a web page.
5 cl, 8 dwg
Title | Year | Author | Number |
---|---|---|---|
SYSTEM AND METHOD FOR SELECTING RELEVANT PAGE ITEMS WITH IMPLICITLY SPECIFYING COORDINATES FOR IDENTIFYING AND VIEWING RELEVANT INFORMATION | 2015 |
|
RU2708790C2 |
METHOD AND SYSTEM FOR COMPUTER PROCESSING OF ONE OR MORE QUOTES IN DIGITAL TEXTS FOR DETERMINATION OF THEIR AUTHOR | 2018 |
|
RU2711123C2 |
METHOD AND SYSTEM FOR GENERATING AN OBJECT CARD | 2018 |
|
RU2739554C1 |
DEPTH REFERENCES FOR NATIVE APPLICATIONS | 2015 |
|
RU2668726C2 |
METHOD OF DETERMINING PROFILE OF MOBILE DEVICE USER ON MOBILE DEVICE ITSELF AND DEMOGRAPHIC PROFILING SYSTEM | 2016 |
|
RU2647661C1 |
SYSTEM AND METHOD FOR GENERATING CLASSIFIER FOR DETECTING PHISHING SITES USING DOM OBJECT HASHES | 2023 |
|
RU2811375C1 |
DEEP LINKS FOR NATIVE APPLICATIONS | 2015 |
|
RU2774319C2 |
SYSTEM AND METHOD FOR COLLECTING INFORMATION FOR DETECTING PHISHING | 2016 |
|
RU2671991C2 |
HYBRID AUTOMATIC SYSTEM FOR CONTROLLING USERS ACCESS TO INFORMATION RESOURCES IN PUBLIC COMPUTER NETWORKS | 2018 |
|
RU2697925C1 |
METHOD FOR DETECTING PHISHING SITES AND SYSTEM THAT IMPLEMENTS IT | 2023 |
|
RU2813242C1 |
Authors
Dates
2023-05-05—Published
2022-04-29—Filed