FIELD: data processing.
SUBSTANCE: group of inventions relates to data processing and can be used to obtain vector representations of data in a table based on the structure of the table and its content. Method comprises the following steps: obtaining data, which includes: text, table structure; table is defined as a set from a list of table header cells and a list of table body cells; each cell of the table body is marked with tags characterizing: a table identifier, a list of atomic columns to which the cell belongs, a list of atomic rows to which the cell belongs; data of each cell of table body is supplemented with information from corresponding cells of headers; performing the text in the table tokenisation; performing position coding at table rows level; forming vector representations of tokens for each token in table by aggregation of vector representations of tokens and positional vector representations; attention matrix is generated, using cell belonging to column or row of table; storing coordinates of boundaries of table cells in sequence of table tokens; the base model receives at the input prepared text and position vector representations of tokens and an attention matrix and processes them to obtain contextualized vector representations of tokens; using stored coordinates of boundaries of table cells, pooling is used to obtain a vector representation of a table cell.
EFFECT: faster process of training a language model when working with spreadsheet documents.
6 cl, 5 dwg
Title | Year | Author | Number |
---|---|---|---|
TEXT CLASSIFICATION METHOD AND SYSTEM | 2022 |
|
RU2818693C2 |
METHOD AND DEVICE FOR DETERMINING FRAUDULENT TRANSACTIONS OF USER | 2024 |
|
RU2839053C1 |
ADJUSTABLE TABLE STYLES FOR SPREADSHEETS | 2006 |
|
RU2419851C2 |
METHOD AND SYSTEM FOR DETECTING OBFUSCATED MALICIOUS COMMANDS IN SYSTEM CONSOLE OF OPERATING SYSTEM | 2024 |
|
RU2838483C1 |
METHOD AND SYSTEM FOR TRAINING CHATBOT SYSTEM | 2023 |
|
RU2820264C1 |
METHOD AND SYSTEM FOR OBTAINING VECTOR REPRESENTATION OF ELECTRONIC TEXT DOCUMENT FOR CLASSIFICATION BY CATEGORIES OF CONFIDENTIAL INFORMATION | 2021 |
|
RU2775358C1 |
METHOD AND DEVICE FOR GENERATING VIDEO CLIP FROM TEXT DESCRIPTION AND SEQUENCE OF KEY POINTS SYNTHESIZED BY DIFFUSION MODEL | 2024 |
|
RU2823216C1 |
EXTRACTING INFORMATION FROM STRUCTURED DOCUMENTS CONTAINING TEXT IN NATURAL LANGUAGE | 2015 |
|
RU2607976C1 |
SYSTEM AND METHOD FOR TRAINING MACHINE LEARNING MODELS FOR RANKING SEARCH RESULTS | 2023 |
|
RU2829065C1 |
METHOD FOR PREDICTION OF DIAGNOSIS BASED ON DATA PROCESSING CONTAINING MEDICAL KNOWLEDGE | 2019 |
|
RU2723674C1 |
Authors
Dates
2025-04-25—Published
2024-08-28—Filed