FIELD: physics.
SUBSTANCE: current images from the series of the original document images are received, wherein the current image, at least, partially, overlaps the previous image from the series of images. The optical character recognition (OCR) of the current image is performed to receive the recognized text and the its corresponding text markup. The reference points are determined for each current and previous image of the corresponding plurality. Each reference point is associated with, at least, one textual artifact of the textual artifact plurality. The corresponding reference points of the current and the previous parameter images for converting the previous image coordinates to the current image coordinates, are calculated using the coordinates. At least, a part of the recognized text is connected to the cluster from the plurality of the symbol sequence clusters, using the coordinate conversion. A line-median is determined for each cluster, representing a cluster of the character sequences.
EFFECT: final recognized text is obtained with this line-median, corresponding to, at least, part of the original document.
20 cl, 12 dwg
Authors
Dates
2017-03-21—Published
2016-05-13—Filed