FIELD: physics, computer engineering.
SUBSTANCE: invention relates to systems and methods of creating corpuses for various research and other purposes. The method of constructing a corpus based on Internet forums for a computer system comprises constructing a document object model (DOM) in the form of a tree DOM data structure; selecting a group of single-type vertices in the DOM tree; removing optional design elements from pages; merging non-sheet vertices with the same names in the object model tree and combining sheet vertices with the same properties; estimating the vertices and filtering groups; constructing XPATH expressions and applying the obtained XPATH expressions to a set of files containing all documents from a selected forum.
EFFECT: high accuracy of separating user text from other content on web pages with automatic construction of a corpus.
10 cl, 3 dwg
Title | Year | Author | Number |
---|---|---|---|
OPTIMISING EXECUTION OF HD-DVD TIMING MARKUP | 2007 |
|
RU2460157C2 |
DEVICE AND METHOD FOR PROCESSING CONTENT OF WEB RESOURCE IN BROWSER | 2014 |
|
RU2595524C2 |
WEBPAGE BROWSING METHOD, WEBAPP FRAMEWORK, METHOD AND DEVICE FOR EXECUTING JAVASCRIPT AND MOBILE TERMINAL | 2013 |
|
RU2604326C2 |
METHODS AND SYSTEMS FOR PROCESSING DOCUMENT OBJECT MODELS (DOM) TO PROCESS VIDEO CONTENT | 2010 |
|
RU2475832C1 |
METHOD AND SYSTEM FOR MODIFYING TEXT IN DOCUMENT | 2015 |
|
RU2610585C2 |
PROGRAMMABILITY FOR XML DATA STORE FOR DOCUMENTS | 2006 |
|
RU2417420C2 |
METHOD OF ANALYSING TEXT DATA TONALITY | 2014 |
|
RU2571373C2 |
SYSTEM AND METHOD FOR GENERATING CLASSIFIER FOR DETECTING PHISHING SITES USING DOM OBJECT HASHES | 2023 |
|
RU2811375C1 |
PROGRAMMING INTERFACE FOR COMPUTER PLATFORM | 2004 |
|
RU2371758C2 |
METHOD FOR DETECTING PHISHING SITES AND SYSTEM THAT IMPLEMENTS IT | 2023 |
|
RU2813242C1 |
Authors
Dates
2015-10-20—Published
2013-11-01—Filed