TWO-PASS HASH EXTRACTION OF TEXT STRINGS Russian patent published in 2012 - IPC G06F17/21

Abstract RU 2464630 C2

FIELD: information technology.

SUBSTANCE: in the method of recognising text, a plurality of terms used in a text string are generated first, and a plurality of hash values are calculated from the plurality of generated terms. For each hash value, a hash bucket may be created where an associated occurrence count may be maintained. The hash buckets may be sorted by occurrence count and a few top buckets may be kept. Once those top buckets are known, a second pass may walk the text string, generate terms, and calculate a hash value for each term. If the hash values of terms match hash values of one of the kept buckets, then the term may be considered a frequent term. Consequently, the term may be added to a dictionary along with a corresponding frequency count. Then, the dictionary may be examined to remove terms that may not be frequent, but appeared due to hash collisions.

EFFECT: reduced amount of memory required for data storage and shorter time for restoring compressed data.

20 cl, 6 dwg

Similar patents RU2464630C2

Title	Year	Author	Number
POP-UP VERIFICATION PANEL	2014	Voronko Artem Nikolaevich	RU2665274C2
TEXT CLASSIFICATION METHOD AND SYSTEM	2022	Konodyuk Nikita Evgenevich Tikhonova Mariya Ivanovna	RU2818693C2
HASH-BASED ENCODER DECISIONS FOR VIDEO CODING	2014	Li, Bin Xu, Ji-Zheng	RU2679981C2
IDENTIFICATION OF FIELDS AND TABLES IN DOCUMENTS USING NEURAL NETWORKS USING GLOBAL DOCUMENT CONTEXT	2019	Stanislav Semenov	RU2723293C1
OPTICAL CHARACTER RECOGNITION BY MEANS OF COMBINATION OF NEURAL NETWORK MODELS	2020	Konstantin Anisimovich Alexey Zhuravlev	RU2768211C1
METHOD AND SYSTEM FOR GENERATION OF ARTICLES IN NATURAL LANGUAGE DICTIONARY	2014	Selegej Vladimir Pavlovich Maramchin Aleksej Sergeevich	RU2639280C2
SYSTEM AND METHOD OF CREATING AND USING USER ONTOLOGY-BASED PATTERNS FOR PROCESSING USER TEXT IN NATURAL LANGUAGE	2015	Bulgakov Ilia Aleksandrovich Yakovlev Egor Nikolaevich Starostin Anatoly Sergeevich	RU2596599C2
DETECTING SECTIONS OF TABLES IN DOCUMENTS BY NEURAL NETWORKS USING GLOBAL DOCUMENT CONTEXT	2019	Stanislav Semenov	RU2721189C1
SYSTEM AND METHOD OF CREATING AND USING USER SEMANTIC DICTIONARIES FOR PROCESSING USER TEXT IN NATURAL LANGUAGE	2015	Yakovlev Egor Nikolaevich Starostin Anatoly Sergeevich	RU2584457C1
DECODER, ENCODER, COMPUTER PROGRAM AND METHOD	2017	Szucs, Paul	RU2744169C2

RU 2 464 630 C2

Authors

Pauzin Dominik

Dates

2012-10-20—Published

2008-08-28—Filed