TWO-PASS HASH EXTRACTION OF TEXT STRINGS Russian patent published in 2012 - IPC G06F17/21 

Abstract RU 2464630 C2

FIELD: information technology.

SUBSTANCE: in the method of recognising text, a plurality of terms used in a text string are generated first, and a plurality of hash values are calculated from the plurality of generated terms. For each hash value, a hash bucket may be created where an associated occurrence count may be maintained. The hash buckets may be sorted by occurrence count and a few top buckets may be kept. Once those top buckets are known, a second pass may walk the text string, generate terms, and calculate a hash value for each term. If the hash values of terms match hash values of one of the kept buckets, then the term may be considered a frequent term. Consequently, the term may be added to a dictionary along with a corresponding frequency count. Then, the dictionary may be examined to remove terms that may not be frequent, but appeared due to hash collisions.

EFFECT: reduced amount of memory required for data storage and shorter time for restoring compressed data.

20 cl, 6 dwg

Similar patents RU2464630C2

Title Year Author Number
POP-UP VERIFICATION PANEL 2014
  • Voronko Artem Nikolaevich
RU2665274C2
TEXT CLASSIFICATION METHOD AND SYSTEM 2022
  • Konodyuk Nikita Evgenevich
  • Tikhonova Mariya Ivanovna
RU2818693C2
HASH-BASED ENCODER DECISIONS FOR VIDEO CODING 2014
  • Li, Bin
  • Xu, Ji-Zheng
RU2679981C2
IDENTIFICATION OF FIELDS AND TABLES IN DOCUMENTS USING NEURAL NETWORKS USING GLOBAL DOCUMENT CONTEXT 2019
  • Stanislav Semenov
RU2723293C1
OPTICAL CHARACTER RECOGNITION BY MEANS OF COMBINATION OF NEURAL NETWORK MODELS 2020
  • Konstantin Anisimovich
  • Alexey Zhuravlev
RU2768211C1
METHOD AND SYSTEM FOR GENERATION OF ARTICLES IN NATURAL LANGUAGE DICTIONARY 2014
  • Selegej Vladimir Pavlovich
  • Maramchin Aleksej Sergeevich
RU2639280C2
SYSTEM AND METHOD OF CREATING AND USING USER ONTOLOGY-BASED PATTERNS FOR PROCESSING USER TEXT IN NATURAL LANGUAGE 2015
  • Bulgakov Ilia Aleksandrovich
  • Yakovlev Egor Nikolaevich
  • Starostin Anatoly Sergeevich
RU2596599C2
DETECTING SECTIONS OF TABLES IN DOCUMENTS BY NEURAL NETWORKS USING GLOBAL DOCUMENT CONTEXT 2019
  • Stanislav Semenov
RU2721189C1
SYSTEM AND METHOD OF CREATING AND USING USER SEMANTIC DICTIONARIES FOR PROCESSING USER TEXT IN NATURAL LANGUAGE 2015
  • Yakovlev Egor Nikolaevich
  • Starostin Anatoly Sergeevich
RU2584457C1
DECODER, ENCODER, COMPUTER PROGRAM AND METHOD 2017
  • Szucs, Paul
RU2744169C2

RU 2 464 630 C2

Authors

Pauzin Dominik

Dates

2012-10-20Published

2008-08-28Filed