FIELD: information technology.
SUBSTANCE: in the method of recognising text, a plurality of terms used in a text string are generated first, and a plurality of hash values are calculated from the plurality of generated terms. For each hash value, a hash bucket may be created where an associated occurrence count may be maintained. The hash buckets may be sorted by occurrence count and a few top buckets may be kept. Once those top buckets are known, a second pass may walk the text string, generate terms, and calculate a hash value for each term. If the hash values of terms match hash values of one of the kept buckets, then the term may be considered a frequent term. Consequently, the term may be added to a dictionary along with a corresponding frequency count. Then, the dictionary may be examined to remove terms that may not be frequent, but appeared due to hash collisions.
EFFECT: reduced amount of memory required for data storage and shorter time for restoring compressed data.
20 cl, 6 dwg
Authors
Dates
2012-10-20—Published
2008-08-28—Filed