FIELD: information technology.
SUBSTANCE: to determine the signature of the text of target document, bounded by the predetermined lower and upper bounds, a plurality of text tokens are selected by selecting a preliminary set of text tokens, determining a pre-set token count, and, when the preliminary token set exceeds a predetermined threshold, truncating this set to form a selected set of tokens so that the selected set does not exceed the threshold. Size of the signature fragment is determined in accordance with the upper and lower bounds and in accordance with the count of the selected set. A plurality of signature fragments are determined according to the hash of an individual token of the selected set. Each fragment contains a sequence of characters whose length is equal to the size of fragment. Concatenation of multiple fragments is performed to form a text signature.
EFFECT: increasing the speed of calculations and reducing the required amount of memory when determining the signature of text without reducing the accuracy of comparing documents by their signatures.
22 cl, 18 dwg, 3 tbl
Title | Year | Author | Number |
---|---|---|---|
SYSTEMS AND METHODS FOR SPAM DETECTION USING CHARACTER HISTOGRAMS | 2012 |
|
RU2601193C2 |
SYSTEM AND METHODS FOR SPAM DETECTION USING FREQUENCY SPECTRA OF CHARACTER STRINGS | 2012 |
|
RU2601190C2 |
SYSTEM AND METHODS FOR DETECTING NETWORK FRAUD | 2017 |
|
RU2744671C2 |
SYSTEMS AND METHODS OF DYNAMIC INDICATORS AGGREGATION TO DETECT NETWORK FRAUD | 2012 |
|
RU2607229C2 |
METHOD AND SYSTEM FOR CREATING A LIST OF ELECTRONIC MESSAGES | 2014 |
|
RU2595496C2 |
METHOD OF DETECTING INSIGNIFICANT LEXICAL ITEMS IN TEXT MESSAGES AND COMPUTER | 2014 |
|
RU2580424C1 |
METHOD AND SYSTEM FOR REFORMATTING ELECTRONIC MESSAGE BASED ON CATEGORY THEREOF | 2014 |
|
RU2595618C2 |
METHOD AND SYSTEM FOR REFORMATTING ELECTRONIC MESSAGE BASED ON CATEGORY THEREOF | 2014 |
|
RU2595619C2 |
METHOD AND SYSTEM FOR CREATING LIST OF ELECTRONIC MESSAGES | 2014 |
|
RU2595617C2 |
TEXT SEGMENTATION METHODS AND SYSTEMS | 2003 |
|
RU2348071C2 |
Authors
Dates
2017-10-04—Published
2014-02-04—Filed