Main idea - pipeline flow
Pipes are components for writing expressive code when working on collections. Pipes chain together into a pipeline that receives data from a source, operates on that data, and sends the results to a destination.
Main elements - short description
Importer - Imports and extracts all data from input streams. Importer contains parser object to parse elements like text, styles, images.
Transformer - Receives data from importer or another transformer and can transform it. For example, we can use transformer to filter emails if it contains specific phrase, translate text to another language or sum values from table columns.
Exporter - Exporter class is responsible for exporting the parsed data from importer or transformer to an output stream. We can export data as plain text or as html. There is an option to write our own specific exporter using data sent by importer and transformer (see Parsing process - control). In similar way there is a possibility to write custom importer or exporter.
Parsing process - control
During parsing process parser sends to us signals with structure doctotext::Info. Signals are emitted when the parser encounters a new node. New node is an abstract element in file which is represented by tag and tag's attributes (doctotext::StandardTag). New node could be for example page, paragraph or link. A node can contain other nodes e.g. email node includes attachments. All node data are kept in Info structure. Additionally, Info allows for control of the parsing process by set flags. Structure doctotext::Info contains two flags:
- skip - skips next node
- cancel - cancels all parsing process
Using these flags we can stop process (e.g. timeout), or we can choose which part of file we would like to parse (e.g. 10 first page of pdf file or 10 last mails from mailbox)
Example of usage "cancel" and "skip": example_4.cpp
Tag name gives us information about the part of the document that was parsed. It could be a part of text (tag text or tag paragraph), table, list, text style and so on. You can find the full list of available tags with description in doctotext::StandardTag. Some of tags contains attributes which are stored in doctotext::Info::attributes. To get attribute from this map you need name of attribute and type. Both are described in doctotext::StandardTag.
Important! Support for control processing proces (skip and cancel) is only for "pst", "ost", "tiff", "jpeg", "bmp", "png", "pnm", "jfif", "jpg", "webp". This list also will be gradually completed.
Parser parameters
doctotext::ParserParameters provides mechanism to pass additional information to parser. For example you can choose processing language for ocr parser. List of common parameters for all parsers are presented below.
Parameters for parsers
| Parameter name | Parameter type | Parameter description |
| log_stream | std::ostream* | Output for parser logs. The default log stream is a std::cerr. |
| verbose_logging | bool | Flag indicating whether log mechanism should be enabled.
|
Parser tags
Each parser sends tags during the parsing process. It's required, because this is the only way to get parsed data from parser. Parsers which don't fully support our api send one tag doctotext::StandardTag::TAG_TEXT with all parsed text from document. For rest of parser we present list of emitted tags:
Tags for parsers
| Parser name | Supported formats | Available tags |
| ODFOOXMLParser | "odt", "ods", "odp", "odg", "docx", "xlsx", "pptx", "ppsx" |
|
| ODFXMLParser | "fodt", "fods", "fodp", "fodg" |
| PSTParser | "pst", "ost" |
-
doctotext::StandardTag::TAG_TAB
-
doctotext::StandardTag::TAG_FOLDER_HEADER
-
doctotext::StandardTag::TAG_MAIL_HEADER
-
doctotext::StandardTag::TAG_ATTACHMENT_HEADER
-
doctotext::StandardTag::TAG_ATTACHMENT_BODY
-
doctotext::StandardTag::TAG_ATTACHMENT_CLOSE_BODY
-
doctotext::StandardTag::TAG_MAIL_BODY
-
doctotext::StandardTag::TAG_MAIL_CLOSE_BODY
|
| PdfParser | "pdf" |
|
| HtmlParser | "html", "htm" |
|
| OcrParser | "tiff", "jpeg", "bmp", "png", "pnm", "jfif", "jpg", "webp" |
|
| EMLParser | "pst", "ost" |
|
| DOCParse | "doc" |
|
| XLSParser | "xls" |
|
| XLSBParser | "xlsb" |
|
| PPTParser | "ppt", "pps" |
|
| IWorkParser | "pages", "key", "numbers" |
|
| RTFParser | "rtf" |
|
| TXTParser | "txt", "text" |
|
Importer and Exporter
Basic example of usage in C++:
#include <iostream>
#include <memory>
#include "importer.h"
#include "exporter.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
if (argc > 1)
{
| std::cout;
}
return 0;
}
The Importer class. This class is used to import a file and parse it using available parsers.
Exporter class for plain text output.
Basic example of usage in C:
#include "stdio.h"
int main(int argc, char *argv[])
{
const char *file_name = argv[1];
return 0;
}
File contains c api for doctotext software.
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_importer_to_exporter(DocToTextImporter *importer, DocToTextExporter *exporter)
Creates connection between importer and exporter and returns DocToTextParsingChain which contains all...
struct DocToTextParserManager DocToTextParserManager
DllExport DocToTextParserManager *DOCTOTEXT_CALL doctotext_init_parser_manager(const char *path_to_plugins)
Creates new parser manager with all available parsers.
DllExport DocToTextExporter *DOCTOTEXT_CALL doctotext_create_plain_text_exporter(FILE *output_stream)
Creates a new DocToTextExporter object. This object is used to export parsed data to output as a plai...
DllExport void DOCTOTEXT_CALL doctotext_free_parsing_chain(DocToTextParsingChain *parsing_chain)
Frees parsing_chain and all resources allocated by the parsing chain. Remember not to use function fr...
DllExport DocToTextImporter *DOCTOTEXT_CALL doctotext_create_importer_from_file_name(DocToTextParserManager *manager, const char *file_name)
Creates a new DocToTextImporter object. This object is used to import a file and parse it using avail...
struct DocToTextExporter DocToTextExporter
struct DocToTextParsingChain DocToTextParsingChain
DllExport void DOCTOTEXT_CALL doctotext_free_exporter(DocToTextExporter *exporter)
Frees exporter and all resources allocated by the exporter. Remember not to use function free()....
DllExport void DOCTOTEXT_CALL doctotext_free_importer(DocToTextImporter *importer)
Frees importer and all resources allocated by the importer. DocToTextImporter is allocated using oper...
struct DocToTextImporter DocToTextImporter
We can also define a second exporter and export output as html to output.html file.
Example for C++:
#include <iostream>
#include <fstream>
#include <memory>
#include "importer.h"
#include "exporter.h"
#include "transformer.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
std::ifstream(argv[1], std::ios_base::in|std::ios_base::binary)
| std::ofstream("output.txt");
std::ifstream(argv[1], std::ios_base::in|std::ios_base::binary)
| std::ofstream("output.html");
return 0;
}
Exporter class for HTML output.
Example for C:
#include "stdio.h"
int main(int argc, char *argv[])
{
FILE *html_file = fopen("output.html", "w");
FILE *plain_text_file = fopen("output.txt", "w");
const char *file_name = argv[1];
return 0;
}
DllExport DocToTextExporter *DOCTOTEXT_CALL doctotext_create_html_exporter(FILE *output_stream)
Creates a new DocToTextExporter object. This object is used to export parsed data to output as a html...
In case of parsing multiple files, we can use the same importer and exporter object for each file. In first step we need to create parsing process by connecting the importer and exporter and then we can start the parsing process by passing subsequent files to the importer.
Example for C++:
#include <iostream>
#include <fstream>
#include <memory>
#include "importer.h"
#include "exporter.h"
#include "transformer.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
| std::cout;
for (int i = 1; i < argc; ++i)
{
std::cout << "Parsing file " << argv[i] << std::endl;
std::ifstream(argv[i], std::ios_base::in|std::ios_base::binary) | chain;
std::cout << std::endl;
}
return 0;
}
Example for C:
#include "stdio.h"
int main(int argc, char *argv[])
{
const char *file_name = argv[1];
for (int i = 1; i < argc; i++)
{
FILE *file = fopen(argv[i], "r");
fclose(file);
}
return 0;
}
DllExport void DOCTOTEXT_CALL doctotext_parsing_chain_set_input(DocToTextParsingChain *parsing_chain, FILE *input_stream)
Adds input stream to the parsing chain. This function starts parsing chain.
Transformer
Transformer is an object that we can connect to the importer and exporter. The transformer receives data from the importer or another transformer and can transform it.
For example, we can use transformer to filter emails if it contains a specific phrase. Other actions for transformer is to skip the data from the current callback or stop the parsing process. Below example shows how to use the transformer to filter mails with subject "Hello"
Example for C++:
#include <iostream>
#include <memory>
#include "parser.h"
#include "importer.h"
#include "exporter.h"
#include "transformer.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
{
{
auto subject = info.getAttributeValue<std::string>("subject");
if (subject)
{
if (subject->find("Hello") != std::string::npos)
{
info.skip = true;
}
}
}
})
| std::cout;
return 0;
}
static const std::string TAG_MAIL
Tag for mail. Attributes: "subject": std::string, "date": uint (unix timestamp).
Wraps single function (doctotext::NewNodeCallback) into Transformer object.
std::string tag_name
tag name
Example for C:
#include "stdio.h"
#include <string.h>
{
if (strcmp(tag_name, "mail-header") == 0)
{
if (strstr(subject, "Hello") != 0)
{
}
}
}
int main(int argc, char *argv[])
{
const char *file_name = argv[1];
return 0;
}
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_parsing_chain_to_exporter(DocToTextParsingChain *parsing_chain, DocToTextExporter *exporter)
Adds exporter to the parsing chain.
struct DocToTextInfo DocToTextInfo
DllExport void DOCTOTEXT_CALL doctotext_info_set_skip(DocToTextInfo *info, bool skip)
Sets skip flag in DocToTextInfo. If skip is true then current node will be skipped....
DllExport void DOCTOTEXT_CALL doctotext_free_transformer(DocToTextTransformer *transformer)
Frees transformer and all resources allocated by the transformer. Remember not to use function free()...
DllExport DocToTextTransformer *DOCTOTEXT_CALL doctotext_create_transfomer(void(*callback)(DocToTextInfo *, void *data), void *data)
Creates a new DocToTextTransformer object. This object is used to transform parsed data....
DllExport const char *DOCTOTEXT_CALL doctotext_info_get_tag_name(DocToTextInfo *info)
DllExport const char *DOCTOTEXT_CALL doctotext_info_get_string_attribute(DocToTextInfo *info, const char *attribute_name)
Returns attribute value as a string from DocToTextInfo.
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_importer_to_transformer(DocToTextImporter *importer, DocToTextTransformer *transformer)
Creates connection between importer and transformer and returns DocToTextParsingChain which contains ...
Transformers can be joined together to create complex transformations/filtration. For example, we can create a transformer that filters mails with subject "Hello" and limit the number of mails to 10.
Example for C++:
#include <iostream>
#include <memory>
#include "parser.h"
#include "importer.h"
#include "exporter.h"
#include "transformer.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
{
{
auto subject = info.getAttributeValue<std::string>("subject");
if (subject)
{
if (subject->find("Hello") != std::string::npos)
{
info.skip = true;
}
}
}
})
{
{
if (++counter > max_mails)
{
info.cancel = true;
}
}
})
| std::cout;
return 0;
}
Example for C:
#include "stdio.h"
#include <string.h>
{
if (strcmp(tag_name, "mail-header") == 0)
{
if (strstr(subject, "Hello") != 0)
{
}
}
}
struct callbackData
{
int mail_counter;
int max_mails_number;
};
{
struct callbackData* p_callback_data = (struct callbackData*)(data);
if (strcmp(tag_name, "mail-header") == 0)
{
if (p_callback_data->mail_counter >= p_callback_data->max_mails_number)
{
}
else
{
p_callback_data->mail_counter++;
}
}
}
int main(int argc, char *argv[])
{
const char *file_name = argv[1];
struct callbackData callback_data;
callback_data.mail_counter = 0;
callback_data.max_mails_number = 10;
return 0;
}
DllExport void DOCTOTEXT_CALL doctotext_info_set_cancel_parser(DocToTextInfo *info, bool cancel)
Sets cancel flag in DocToTextInfo. If cancel is true then parsing chain will be stop....
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_parsing_chain_to_transformer(DocToTextParsingChain *parsing_chain, DocToTextTransformer *transformer)
Adds transformer to the parsing chain.
Callbacks Api
Another approach to parse documents is to use the callbacks api. We can create specific parser object and connect it to the callback functions. In case of the callback api, we need to define writing parsed text by ourself. Below is a basic example of the callback api:
example for C++:
#include <algorithm>
#include <iostream>
#include <memory>
#include "parser.h"
#include "parser_builder.h"
#include "plain_text_writer.h"
int main(int argc, char* argv[])
{
std::string path = argv[1];
auto plain_text_writer = std::make_shared<doctotext::PlainTextWriter>();
plain_text_writer->write_header(std::cout);
if (parser_builder)
{
(*parser_builder)->build(path)
{
plain_text_writer->write_to(info,
std::cout);
})
.parse();
}
plain_text_writer->write_footer(std::cout);
return 0;
}
Parser manager class. Loads all available parsers and provides access to them.
std::optional< ParserBuilder * > findParserByExtension(const std::string &file_name) const
Returns parser builder for given extension type or nullopt if no parser is found.
example for C:
#include <stdbool.h>
#include "stdio.h"
#include "string.h"
struct callbackData
{
};
{
struct callbackData* p_callback_data = (struct callbackData*)(data);
}
int main(int argc, char *argv[])
{
struct callbackData callback_data;
callback_data.writer = writer;
return 0;
}
struct DocToTextParser DocToTextParser
DllExport DocToTextWriter *DOCTOTEXT_CALL doctotext_create_html_writer()
Creates HtmlWriter. HtmlWriter writes parsed date from callbacks as html. Example of usage:
DllExport void DOCTOTEXT_CALL doctotext_writer_write_header(DocToTextWriter *writer, FILE *out_stream)
Returns beginning of text from callbacks.
DllExport void DOCTOTEXT_CALL doctotext_parser_add_callback_on_new_node(DocToTextParser *parser, void(*callback)(DocToTextInfo *, void *data), void *data)
Adds new function to execute when new node will be parsed. Node is a part of hierarchical structure....
DllExport void DOCTOTEXT_CALL doctotext_writer_write(DocToTextWriter *writer, DocToTextInfo *info, FILE *out_stream)
Converts text from callback to html format.
DllExport DocToTextParser *DOCTOTEXT_CALL doctotext_parser_manager_get_parser_by_extension(DocToTextParserManager *parser_manager, const char *format)
Returns proper parser for given format. The format is defined by file extension. Example of usage:
DllExport void DOCTOTEXT_CALL doctotext_free_parser(DocToTextParser *parser)
Frees parser. Remember not to use function free(). DocToTextParser is allocated using operator new (f...
DllExport void DOCTOTEXT_CALL doctotext_writer_write_footer(DocToTextWriter *writer, FILE *out_stream)
Returns end of text from callbacks.
struct DocToTextWriter DocToTextWriter
DllExport void DOCTOTEXT_CALL doctotext_parser_parse(DocToTextParser *parser)
Start parsing loaded data. The data comes from file or from buffer.
In callback api we can add many callback functions to the parser and it works in similar way as the transformer in the stream api. So we are able to add a callback function to filter by mail topic or by mail number in similar way like in stream api.
example for C++:
#include <algorithm>
#include <iostream>
#include <memory>
#include "parser.h"
#include "parser_builder.h"
#include "plain_text_writer.h"
int main(int argc, char* argv[])
{
auto parser_manager = std::make_shared<doctotext::ParserManager>();
std::string path = argv[1];
plain_text_writer.write_header(std::cout);
if (parser_builder)
{
(*parser_builder)->withParserManager(parser_manager)
.build(path)
{
{
auto subject = info.getAttributeValue<std::string>(
"subject");
if (subject)
{
if (subject->find("Hello") != std::string::npos)
{
info.skip = true;
}
}
}
})
.addOnNewNodeCallback([&plain_text_writer](
{
plain_text_writer.
write_to(info, std::cout);
})
.parse();
}
return 0;
}
void write_footer(std::ostream &stream) const override
Write footer for plain text format.
void write_to(const doctotext::Info &info, std::ostream &stream) const override
Converts text from callback to plain text format.
example for C:
#include <stdbool.h>
#include "stdio.h"
#include "string.h"
struct callbackData
{
};
{
struct callbackData* p_callback_data = (struct callbackData*)(data);
}
{
if (strcmp(tag_name, "mail-header") == 0)
{
if (strstr(subject, "Hello") != 0)
{
}
}
}
int main(int argc, char *argv[])
{
struct callbackData callback_data;
callback_data.writer = writer;
return 0;
}
SimpleExtractor
The easiest way to parse the document is to use the doctotext::SimpleExtractor. The simple extractor provides the basic functionality to parse the document.
example for C++:
#include <iostream>
#include "simple_extractor.h"
int main(int argc, char* argv[])
{
std::cout << simple_extractor.getPlainText();
}
Example for C:
#include <stdbool.h>
#include "stdio.h"
#include "string.h"
int main(int argc, char *argv[])
{
if (extractor)
{
printf("%s", text);
}
return 0;
}
DllExport const char *DOCTOTEXT_CALL doctotext_simple_extractor_get_plain_text(DocToTextSimpleExtractor *extractor)
Gets parsed plain text from a DocToTextSimpleExtractor object.
struct DocToTextSimpleExtractor DocToTextSimpleExtractor
DllExport DocToTextSimpleExtractor *DOCTOTEXT_CALL doctotext_create_simple_extractor(const char *file_name)
Creates a new DocToTextSimpleExtractor object. Example:
SimpleExtractor also supports custom callback functions, so we can define our own transform or filter functions and use them in the parsing process.
example for C++:
#include <iostream>
#include "parser.h"
#include "simple_extractor.h"
int main(int argc, char* argv[])
{
{
{
auto date = info.getAttributeValue<int>("date");
if (date)
{
if (*date < 1651437232)
{
info.skip = true;
}
}
}
});
std::cout << simple_extractor.getPlainText();
}
Example for C:
#include <stdbool.h>
#include "stdio.h"
#include "string.h"
{
if (strcmp(tag_name, "mail-header") == 0)
{
if (date < 1651437232)
{
}
}
}
int main(int argc, char *argv[])
{
if (extractor)
{
printf("%s", text);
}
return 0;
}
DllExport void DOCTOTEXT_CALL doctotext_simple_extractor_add_callback_function(DocToTextSimpleExtractor *extractor, void(*callback)(DocToTextInfo *, void *data), void *data)
Adds a callback function to be called during parsing. Example of usage:
DllExport unsigned int DOCTOTEXT_CALL doctotext_info_get_uint_attribute(DocToTextInfo *info, const char *attribute_name)
Returns attribute value as a unsigned integer from DocToTextInfo.
Logs mechanism
Doctotext parsers generate many logs with current processing status, warnings and errors. By default all logs are sent to standard error stream, but there is an option to change it or also turn off logs mechanism. To redirect logs stream or turn on/off logs we can use doctotext::ParserParameters (Parser parameters) as below:
std::ofstream* my_log_stream = new std::ofstream("output_logs.txt");
Stores list of parsers parameters. Every parser can query ParserParameter for a specific parameter....