DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
API

Main idea - pipeline flow

Pipes are components for writing expressive code when working on collections. Pipes chain together into a pipeline that receives data from a source, operates on that data, and sends the results to a destination.

Main elements - short description

Importer - Imports and extracts all data from input streams. Importer contains parser object to parse elements like text, styles, images.
Transformer - Receives data from importer or another transformer and can transform it. For example, we can use transformer to filter emails if it contains specific phrase, translate text to another language or sum values from table columns.
Exporter - Exporter class is responsible for exporting the parsed data from importer or transformer to an output stream. We can export data as plain text or as html. There is an option to write our own specific exporter using data sent by importer and transformer (see Parsing process - control). In similar way there is a possibility to write custom importer or exporter.

Parsing process - control

During parsing process parser sends to us signals with structure doctotext::Info. Signals are emitted when the parser encounters a new node. New node is an abstract element in file which is represented by tag and tag's attributes (doctotext::StandardTag). New node could be for example page, paragraph or link. A node can contain other nodes e.g. email node includes attachments. All node data are kept in Info structure. Additionally, Info allows for control of the parsing process by set flags. Structure doctotext::Info contains two flags:

  • skip - skips next node
  • cancel - cancels all parsing process

Using these flags we can stop process (e.g. timeout), or we can choose which part of file we would like to parse (e.g. 10 first page of pdf file or 10 last mails from mailbox)
Example of usage "cancel" and "skip": example_4.cpp

Tag name gives us information about the part of the document that was parsed. It could be a part of text (tag text or tag paragraph), table, list, text style and so on. You can find the full list of available tags with description in doctotext::StandardTag. Some of tags contains attributes which are stored in doctotext::Info::attributes. To get attribute from this map you need name of attribute and type. Both are described in doctotext::StandardTag.

Important! Support for control processing proces (skip and cancel) is only for "pst", "ost", "tiff", "jpeg", "bmp", "png", "pnm", "jfif", "jpg", "webp". This list also will be gradually completed.

Parser parameters

doctotext::ParserParameters provides mechanism to pass additional information to parser. For example you can choose processing language for ocr parser. List of common parameters for all parsers are presented below.

Parameters for parsers
Parameter name Parameter type Parameter description
log_stream std::ostream* Output for parser logs. The default log stream is a std::cerr.
verbose_logging bool

Flag indicating whether log mechanism should be enabled.

Parser tags

Each parser sends tags during the parsing process. It's required, because this is the only way to get parsed data from parser. Parsers which don't fully support our api send one tag doctotext::StandardTag::TAG_TEXT with all parsed text from document. For rest of parser we present list of emitted tags:

Tags for parsers
Parser name Supported formats Available tags
ODFOOXMLParser "odt", "ods", "odp", "odg", "docx", "xlsx", "pptx", "ppsx"
ODFXMLParser "fodt", "fods", "fodp", "fodg"
PSTParser "pst", "ost"
  • doctotext::StandardTag::TAG_TAB
  • doctotext::StandardTag::TAG_FOLDER_HEADER
  • doctotext::StandardTag::TAG_MAIL_HEADER
  • doctotext::StandardTag::TAG_ATTACHMENT_HEADER
  • doctotext::StandardTag::TAG_ATTACHMENT_BODY
  • doctotext::StandardTag::TAG_ATTACHMENT_CLOSE_BODY
  • doctotext::StandardTag::TAG_MAIL_BODY
  • doctotext::StandardTag::TAG_MAIL_CLOSE_BODY
PdfParser "pdf"
HtmlParser "html", "htm"
OcrParser "tiff", "jpeg", "bmp", "png", "pnm", "jfif", "jpg", "webp"
EMLParser "pst", "ost"
DOCParse "doc"
XLSParser "xls"
XLSBParser "xlsb"
PPTParser "ppt", "pps"
IWorkParser "pages", "key", "numbers"
RTFParser "rtf"
TXTParser "txt", "text"

Importer and Exporter

Basic example of usage in C++:

#include <iostream>
#include <memory>
#include "importer.h"
#include "exporter.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
if (argc > 1)
{
| std::cout; // parse file and print to standard output
}
return 0;
}
The Importer class. This class is used to import a file and parse it using available parsers.
Definition: importer.h:57
Exporter class for plain text output.
Definition: exporter.h:137

Basic example of usage in C:

#include "stdio.h"
int main(int argc, char *argv[])
{
DocToTextParserManager *manager = doctotext_init_parser_manager(); // create a parser manager (load parsers)
const char *file_name = argv[1];
DocToTextImporter *importer = doctotext_create_importer_from_file_name(manager, file_name); // create an importer and set the input file
DocToTextExporter *exporter = doctotext_create_plain_text_exporter(stdout); // create an exporter to plain text and set the output stream
DocToTextParsingChain *chain = doctotext_connect_importer_to_exporter(importer, exporter); // create a parsing chain by connecting importer and exporter (This step starts the parsing chain)
doctotext_free_importer(importer); // free importer
doctotext_free_exporter(exporter); // free exporter
doctotext_free_parsing_chain(chain); // free parsing chain
return 0;
}
File contains c api for doctotext software.
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_importer_to_exporter(DocToTextImporter *importer, DocToTextExporter *exporter)
Creates connection between importer and exporter and returns DocToTextParsingChain which contains all...
struct DocToTextParserManager DocToTextParserManager
DllExport DocToTextParserManager *DOCTOTEXT_CALL doctotext_init_parser_manager(const char *path_to_plugins)
Creates new parser manager with all available parsers.
DllExport DocToTextExporter *DOCTOTEXT_CALL doctotext_create_plain_text_exporter(FILE *output_stream)
Creates a new DocToTextExporter object. This object is used to export parsed data to output as a plai...
DllExport void DOCTOTEXT_CALL doctotext_free_parsing_chain(DocToTextParsingChain *parsing_chain)
Frees parsing_chain and all resources allocated by the parsing chain. Remember not to use function fr...
DllExport DocToTextImporter *DOCTOTEXT_CALL doctotext_create_importer_from_file_name(DocToTextParserManager *manager, const char *file_name)
Creates a new DocToTextImporter object. This object is used to import a file and parse it using avail...
struct DocToTextExporter DocToTextExporter
struct DocToTextParsingChain DocToTextParsingChain
DllExport void DOCTOTEXT_CALL doctotext_free_exporter(DocToTextExporter *exporter)
Frees exporter and all resources allocated by the exporter. Remember not to use function free()....
DllExport void DOCTOTEXT_CALL doctotext_free_importer(DocToTextImporter *importer)
Frees importer and all resources allocated by the importer. DocToTextImporter is allocated using oper...
struct DocToTextImporter DocToTextImporter

We can also define a second exporter and export output as html to output.html file.

Example for C++:

#include <iostream>
#include <fstream>
#include <memory>
#include "importer.h"
#include "exporter.h"
#include "transformer.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
// parse file and print to output.txt file
std::ifstream(argv[1], std::ios_base::in|std::ios_base::binary)
| std::ofstream("output.txt");
// parse file and print to output.html file
std::ifstream(argv[1], std::ios_base::in|std::ios_base::binary)
| std::ofstream("output.html");
return 0;
}
Exporter class for HTML output.
Definition: exporter.h:124

Example for C:

#include "stdio.h"
int main(int argc, char *argv[])
{
FILE *html_file = fopen("output.html", "w"); // create a file to export html
FILE *plain_text_file = fopen("output.txt", "w"); // create a file to export plain text
DocToTextParserManager *manager = doctotext_init_parser_manager("plugins/"); // create a parser manager (load parsers)
const char *file_name = argv[1];
DocToTextImporter *importer = doctotext_create_importer_from_file_name(manager, file_name); // create an importer and set the input file
DocToTextExporter *plain_text_exporter = doctotext_create_plain_text_exporter(plain_text_file); // create an exporter to plain text and set the output stream
DocToTextExporter *html_exporter = doctotext_create_html_exporter(html_file); // create an exporter to html and set the output stream
DocToTextParsingChain *chain_1 = doctotext_connect_importer_to_exporter(importer, plain_text_exporter); // create a parsing chain by connecting importer and plain text exporter (This step starts the parsing chain)
DocToTextParsingChain *chain_2 = doctotext_connect_importer_to_exporter(importer, html_exporter); // create a second parsing chain by connecting importer and html exporter (This step starts the parsing chain)
doctotext_free_importer(importer); // free importer
doctotext_free_exporter(plain_text_exporter); // free plain text exporter
doctotext_free_exporter(html_exporter); // free html exporter
doctotext_free_parsing_chain(chain_1); // free parsing chain
doctotext_free_parsing_chain(chain_2); // free parsing chain
return 0;
}
DllExport DocToTextExporter *DOCTOTEXT_CALL doctotext_create_html_exporter(FILE *output_stream)
Creates a new DocToTextExporter object. This object is used to export parsed data to output as a html...

In case of parsing multiple files, we can use the same importer and exporter object for each file. In first step we need to create parsing process by connecting the importer and exporter and then we can start the parsing process by passing subsequent files to the importer.

Example for C++:

#include <iostream>
#include <fstream>
#include <memory>
#include "importer.h"
#include "exporter.h"
#include "transformer.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
auto chain = doctotext::Importer()
| std::cout; // create a chain of steps to parse a file
for (int i = 1; i < argc; ++i)
{
std::cout << "Parsing file " << argv[i] << std::endl;
std::ifstream(argv[i], std::ios_base::in|std::ios_base::binary) | chain; // set the input file as an input stream
std::cout << std::endl;
}
return 0;
}

Example for C:

#include "stdio.h"
int main(int argc, char *argv[])
{
DocToTextParserManager *manager = doctotext_init_parser_manager("plugins/"); // create a parser manager (load parsers)
const char *file_name = argv[1];
DocToTextExporter *exporter = doctotext_create_plain_text_exporter(stdout); // create an exporter to plain text and set the output stream
DocToTextParsingChain *parsing_chain = doctotext_connect_importer_to_exporter(importer, exporter); // create a parsing chain by connecting importer and exporter
// (This step doesn't start the parsing chain because the input is not set)
for (int i = 1; i < argc; i++) // iterate over all files
{
FILE *file = fopen(argv[i], "r"); // open the file
doctotext_parsing_chain_set_input(parsing_chain, file); // set the input file (This step starts the parsing chain for the current file)
fclose(file); // close the file
}
doctotext_free_importer(importer); // free importer
doctotext_free_exporter(exporter); // free exporter
doctotext_free_parsing_chain(parsing_chain); // free parsing chain
return 0;
}
DllExport void DOCTOTEXT_CALL doctotext_parsing_chain_set_input(DocToTextParsingChain *parsing_chain, FILE *input_stream)
Adds input stream to the parsing chain. This function starts parsing chain.

Transformer

Transformer is an object that we can connect to the importer and exporter. The transformer receives data from the importer or another transformer and can transform it.

For example, we can use transformer to filter emails if it contains a specific phrase. Other actions for transformer is to skip the data from the current callback or stop the parsing process. Below example shows how to use the transformer to filter mails with subject "Hello"

Example for C++:

#include <iostream>
#include <memory>
#include "parser.h"
#include "importer.h"
#include "exporter.h"
#include "transformer.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
| doctotext::TransformerFunc([](doctotext::Info &info) // Create an importer from file name and connect it to transformer
{
if (info.tag_name == doctotext::StandardTag::TAG_MAIL) // if current node is mail
{
auto subject = info.getAttributeValue<std::string>("subject"); // get the subject attribute
if (subject) // if subject attribute exists
{
if (subject->find("Hello") != std::string::npos) // if subject contains "Hello"
{
info.skip = true; // skip the current node
}
}
}
})
| doctotext::PlainTextExporter() // sets exporter to plain text
| std::cout; // sets output stream
return 0;
}
static const std::string TAG_MAIL
Tag for mail. Attributes: "subject": std::string, "date": uint (unix timestamp).
Definition: parser.h:82
Wraps single function (doctotext::NewNodeCallback) into Transformer object.
Definition: transformer.h:87
std::string tag_name
tag name
Definition: parser.h:100

Example for C:

#include "stdio.h"
#include <string.h>
void filterMailsBySubject(DocToTextInfo* info, void* data) // callback function to filter by subject text
{
const char * tag_name = doctotext_info_get_tag_name(info); // get the tag name of current node
if (strcmp(tag_name, "mail-header") == 0) // if current node is mail header
{
const char *subject = doctotext_info_get_string_attribute(info, "subject"); // get the subject attribute
if (strstr(subject, "Hello") != 0) // if subject contains "Hello"
{
doctotext_info_set_skip(info, true); // skip the current node
}
}
}
int main(int argc, char *argv[])
{
DocToTextParserManager *manager = doctotext_init_parser_manager("plugins/"); // create a parser manager (load parsers)
const char *file_name = argv[1];
DocToTextImporter *importer = doctotext_create_importer_from_file_name(manager, file_name); // create an importer from file name
DocToTextExporter *exporter = doctotext_create_plain_text_exporter(stdout); // create an exporter to plain text and set the output stream
DocToTextInfo *transformer = doctotext_create_transfomer(filterMailsBySubject, NULL); // create a transformer and set the callback function
DocToTextParsingChain *chain = doctotext_connect_importer_to_transformer(importer, transformer); // create a parsing chain by connecting importer and transformer
chain = doctotext_connect_parsing_chain_to_exporter(chain, exporter); // connect the parsing chain to exporter (This step starts the parsing)
doctotext_free_importer(importer); // free importer
doctotext_free_transformer(transformer); // free transformer
doctotext_free_exporter(exporter); // free exporter
doctotext_free_parsing_chain(chain); // free parsing chain
return 0;
}
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_parsing_chain_to_exporter(DocToTextParsingChain *parsing_chain, DocToTextExporter *exporter)
Adds exporter to the parsing chain.
struct DocToTextInfo DocToTextInfo
DllExport void DOCTOTEXT_CALL doctotext_info_set_skip(DocToTextInfo *info, bool skip)
Sets skip flag in DocToTextInfo. If skip is true then current node will be skipped....
DllExport void DOCTOTEXT_CALL doctotext_free_transformer(DocToTextTransformer *transformer)
Frees transformer and all resources allocated by the transformer. Remember not to use function free()...
DllExport DocToTextTransformer *DOCTOTEXT_CALL doctotext_create_transfomer(void(*callback)(DocToTextInfo *, void *data), void *data)
Creates a new DocToTextTransformer object. This object is used to transform parsed data....
DllExport const char *DOCTOTEXT_CALL doctotext_info_get_tag_name(DocToTextInfo *info)
DllExport const char *DOCTOTEXT_CALL doctotext_info_get_string_attribute(DocToTextInfo *info, const char *attribute_name)
Returns attribute value as a string from DocToTextInfo.
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_importer_to_transformer(DocToTextImporter *importer, DocToTextTransformer *transformer)
Creates connection between importer and transformer and returns DocToTextParsingChain which contains ...

Transformers can be joined together to create complex transformations/filtration. For example, we can create a transformer that filters mails with subject "Hello" and limit the number of mails to 10.

Example for C++:

#include <iostream>
#include <memory>
#include "parser.h"
#include "importer.h"
#include "exporter.h"
#include "transformer.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
doctotext::Importer(argv[1]) | doctotext::TransformerFunc([](doctotext::Info &info) // Create an importer from file name and connect it to transformer
{
if (info.tag_name == doctotext::StandardTag::TAG_MAIL) // if current node is mail
{
auto subject = info.getAttributeValue<std::string>("subject"); // get the subject attribute
if (subject) // if subject attribute exists
{
if (subject->find("Hello") != std::string::npos) // if subject contains "Hello"
{
info.skip = true; // skip the current node
}
}
}
})
| doctotext::TransformerFunc([counter = 0, max_mails = 1](doctotext::Info &info) mutable // Create a transformer and connect it to previous transformer
{
if (info.tag_name == doctotext::StandardTag::TAG_MAIL) // if current node is mail
{
if (++counter > max_mails) // if counter is greater than max_mails
{
info.cancel = true; // cancel the parsing process
}
}
})
| doctotext::PlainTextExporter() // sets exporter to plain text
| std::cout; // sets output stream
return 0;
}

Example for C:

#include "stdio.h"
#include <string.h>
void filterMailsBySubject(DocToTextInfo* info, void* data) // callback function to filter by subject text
{
const char * tag_name = doctotext_info_get_tag_name(info); // get the tag name of current node
if (strcmp(tag_name, "mail-header") == 0) // if current node is mail header
{
const char *subject = doctotext_info_get_string_attribute(info, "subject"); // get the subject attribute
if (strstr(subject, "Hello") != 0) // if subject contains "Hello"
{
doctotext_info_set_skip(info, true); // skip the current node
}
}
}
struct callbackData
{
int mail_counter;
int max_mails_number;
};
void mailsLimitation(DocToTextInfo* info, void* data)
{
struct callbackData* p_callback_data = (struct callbackData*)(data);
const char * tag_name = doctotext_info_get_tag_name(info); // get the tag name of current node
if (strcmp(tag_name, "mail-header") == 0)
{
if (p_callback_data->mail_counter >= p_callback_data->max_mails_number)
{
}
else
{
p_callback_data->mail_counter++;
}
}
}
int main(int argc, char *argv[])
{
DocToTextParserManager *manager = doctotext_init_parser_manager("plugins/"); // create a parser manager (load parsers)
const char *file_name = argv[1];
DocToTextImporter *importer = doctotext_create_importer_from_file_name(manager, file_name); // create an importer from file name
DocToTextExporter *exporter = doctotext_create_plain_text_exporter(stdout); // create an exporter to plain text and set the output stream
DocToTextInfo *transformer = doctotext_create_transfomer(filterMailsBySubject, NULL); // create a transformer and set the callback function
DocToTextParsingChain *chain = doctotext_connect_importer_to_transformer(importer, transformer); // create a parsing chain by connecting importer and transformer
struct callbackData callback_data; // create a callback data structure
callback_data.mail_counter = 0; // initialize the mail counter
callback_data.max_mails_number = 10; // set the maximum number of mails to 10
DocToTextInfo *transformer2 = doctotext_create_transfomer(mailsLimitation, &callback_data); // create a transformer and set the callback function and callback data
chain = doctotext_connect_parsing_chain_to_transformer(chain, transformer2); // connect the parsing chain to the transformer
chain = doctotext_connect_parsing_chain_to_exporter(chain, exporter); // connect the parsing chain to exporter (This step starts the parsing)
doctotext_free_importer(importer); // free importer
doctotext_free_transformer(transformer); // free transformer
doctotext_free_exporter(exporter); // free exporter
doctotext_free_parsing_chain(chain); // free parsing chain
return 0;
}
DllExport void DOCTOTEXT_CALL doctotext_info_set_cancel_parser(DocToTextInfo *info, bool cancel)
Sets cancel flag in DocToTextInfo. If cancel is true then parsing chain will be stop....
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_parsing_chain_to_transformer(DocToTextParsingChain *parsing_chain, DocToTextTransformer *transformer)
Adds transformer to the parsing chain.

Callbacks Api

Another approach to parse documents is to use the callbacks api. We can create specific parser object and connect it to the callback functions. In case of the callback api, we need to define writing parsed text by ourself. Below is a basic example of the callback api:

example for C++:

#include <algorithm>
#include <iostream>
#include <memory>
#include "parser.h"
#include "parser_builder.h"
#include "plain_text_writer.h"
int main(int argc, char* argv[])
{
doctotext::ParserManager parser_manager; // Create parser manager (load parsers)
std::string path = argv[1];
auto parser_builder = parser_manager.findParserByExtension(path); // get the parser builder by extension
auto plain_text_writer = std::make_shared<doctotext::PlainTextWriter>(); // create a plain text writer
plain_text_writer->write_header(std::cout); // write the header to the output stream
if (parser_builder) // if parser builder exists
{
(*parser_builder)->build(path) // build the parser
->addOnNewNodeCallback([&plain_text_writer](doctotext::Info &info) // add a callback function
{
plain_text_writer->write_to(info,
std::cout); // write the node to the output stream
})
.parse(); // start the parsing process
}
plain_text_writer->write_footer(std::cout); // write the footer to the output stream
return 0;
}
Parser manager class. Loads all available parsers and provides access to them.
std::optional< ParserBuilder * > findParserByExtension(const std::string &file_name) const
Returns parser builder for given extension type or nullopt if no parser is found.

example for C:

#include <stdbool.h>
#include "stdio.h"
#include "string.h"
struct callbackData
{
DocToTextWriter *writer;
};
void onNewNodeCallback(DocToTextInfo* info, void* data) // callback function for new node
{
struct callbackData* p_callback_data = (struct callbackData*)(data); // get callback data
doctotext_writer_write(p_callback_data->writer, info, stdout); // write the parsed text to the output stream
}
int main(int argc, char *argv[])
{
DocToTextParserManager *parser_manager = doctotext_init_parser_manager("plugins/"); // create a parser manager (load parsers)
DocToTextParser* parser = doctotext_parser_manager_get_parser_by_extension(parser_manager, argv[1]); // get parser from item
struct callbackData callback_data; // create a callback data structure
DocToTextWriter* writer = doctotext_create_html_writer(); // create a writer (html writer)
callback_data.writer = writer;
doctotext_parser_add_callback_on_new_node(parser, &onNewNodeCallback, &callback_data); // add callback function for new node
doctotext_writer_write_header(writer, stdout); // write the header of the output file
doctotext_parser_parse(parser); // parse the document
doctotext_writer_write_footer(writer, stdout); // write the footer of the output file
doctotext_free_parser(parser); // free parser
return 0;
}
struct DocToTextParser DocToTextParser
DllExport DocToTextWriter *DOCTOTEXT_CALL doctotext_create_html_writer()
Creates HtmlWriter. HtmlWriter writes parsed date from callbacks as html. Example of usage:
DllExport void DOCTOTEXT_CALL doctotext_writer_write_header(DocToTextWriter *writer, FILE *out_stream)
Returns beginning of text from callbacks.
DllExport void DOCTOTEXT_CALL doctotext_parser_add_callback_on_new_node(DocToTextParser *parser, void(*callback)(DocToTextInfo *, void *data), void *data)
Adds new function to execute when new node will be parsed. Node is a part of hierarchical structure....
DllExport void DOCTOTEXT_CALL doctotext_writer_write(DocToTextWriter *writer, DocToTextInfo *info, FILE *out_stream)
Converts text from callback to html format.
DllExport DocToTextParser *DOCTOTEXT_CALL doctotext_parser_manager_get_parser_by_extension(DocToTextParserManager *parser_manager, const char *format)
Returns proper parser for given format. The format is defined by file extension. Example of usage:
DllExport void DOCTOTEXT_CALL doctotext_free_parser(DocToTextParser *parser)
Frees parser. Remember not to use function free(). DocToTextParser is allocated using operator new (f...
DllExport void DOCTOTEXT_CALL doctotext_writer_write_footer(DocToTextWriter *writer, FILE *out_stream)
Returns end of text from callbacks.
struct DocToTextWriter DocToTextWriter
DllExport void DOCTOTEXT_CALL doctotext_parser_parse(DocToTextParser *parser)
Start parsing loaded data. The data comes from file or from buffer.

In callback api we can add many callback functions to the parser and it works in similar way as the transformer in the stream api. So we are able to add a callback function to filter by mail topic or by mail number in similar way like in stream api.

example for C++:

#include <algorithm>
#include <iostream>
#include <memory>
#include "parser.h"
#include "parser_builder.h"
#include "plain_text_writer.h"
int main(int argc, char* argv[])
{
auto parser_manager = std::make_shared<doctotext::ParserManager>(); // Create parser manager (load parsers)
std::string path = argv[1];
auto parser_builder = parser_manager->findParserByExtension(path); // get the parser builder by extension
doctotext::PlainTextWriter plain_text_writer; // create a plain text writer
plain_text_writer.write_header(std::cout); // write the header to the output stream
if (parser_builder) // if parser builder exists
{
(*parser_builder)->withParserManager(parser_manager) // set the parser manager
.build(path) // build the parser
->addOnNewNodeCallback([](doctotext::Info &info) // add a callback function to filter by subject text
{
if (info.tag_name ==
doctotext::StandardTag::TAG_MAIL) // if current node is mail
{
auto subject = info.getAttributeValue<std::string>(
"subject"); // get the subject attribute
if (subject) // if subject attribute exists
{
if (subject->find("Hello") != std::string::npos) // if subject contains "Hello"
{
info.skip = true; // skip the current node
}
}
}
})
.addOnNewNodeCallback([&plain_text_writer](
doctotext::Info &info) // add callback function to write the parsed text to the output stream
{
plain_text_writer.write_to(info, std::cout); // write the node to the output stream
})
.parse(); // start the parsing process
}
plain_text_writer.write_footer(std::cout); // write the footer to the output stream
return 0;
}
void write_footer(std::ostream &stream) const override
Write footer for plain text format.
void write_to(const doctotext::Info &info, std::ostream &stream) const override
Converts text from callback to plain text format.

example for C:

#include <stdbool.h>
#include "stdio.h"
#include "string.h"
struct callbackData
{
DocToTextWriter *writer;
};
void onNewNodeCallback(DocToTextInfo* info, void* data) // callback function for new node
{
struct callbackData* p_callback_data = (struct callbackData*)(data); // get callback data
doctotext_writer_write(p_callback_data->writer, info, stdout); // write the parsed text to the output stream
}
void filterMailsBySubject(DocToTextInfo* info, void* data) // callback function to filter by subject text
{
const char * tag_name = doctotext_info_get_tag_name(info); // get the tag name of current node
if (strcmp(tag_name, "mail-header") == 0) // if current node is mail header
{
const char *subject = doctotext_info_get_string_attribute(info, "subject"); // get the subject attribute
if (strstr(subject, "Hello") != 0) // if subject contains "Hello"
{
doctotext_info_set_skip(info, true); // skip the current node
}
}
}
int main(int argc, char *argv[])
{
DocToTextParserManager *parser_manager = doctotext_init_parser_manager("plugins/"); // create a parser manager (load parsers)
DocToTextParser* parser = doctotext_parser_manager_get_parser_by_extension(parser_manager, argv[1]); // get parser from item
struct callbackData callback_data; // create a callback data structure
DocToTextWriter* writer = doctotext_create_html_writer(); // create a writer (html writer)
callback_data.writer = writer;
doctotext_parser_add_callback_on_new_node(parser, &filterMailsBySubject, NULL); // add callback function for filter by subject text
doctotext_parser_add_callback_on_new_node(parser, &onNewNodeCallback, &callback_data); // add callback function to write the parsed text to the output stream
doctotext_writer_write_header(writer, stdout); // write the header of the output file
doctotext_parser_parse(parser); // parse the document
doctotext_writer_write_footer(writer, stdout); // write the footer of the output file
doctotext_free_parser(parser); // free parser
return 0;
}

SimpleExtractor

The easiest way to parse the document is to use the doctotext::SimpleExtractor. The simple extractor provides the basic functionality to parse the document.

example for C++:

#include <iostream>
#include "simple_extractor.h"
int main(int argc, char* argv[])
{
doctotext::SimpleExtractor simple_extractor(argv[1]); // create a simple extractor
std::cout << simple_extractor.getPlainText(); // print the plain text of the document
}
The SimpleExtractor class provides basic functionality for extracting text from a document.

Example for C:

#include <stdbool.h>
#include "stdio.h"
#include "string.h"
int main(int argc, char *argv[])
{
DocToTextSimpleExtractor* extractor = doctotext_create_simple_extractor(argv[1]); // create a simple extractor
if (extractor) // if extractor exists
{
const char* text = doctotext_simple_extractor_get_plain_text(extractor); // get the plain text of the document. Call this function cause starts the parsing process.
printf("%s", text); // print the plain text
}
return 0;
}
DllExport const char *DOCTOTEXT_CALL doctotext_simple_extractor_get_plain_text(DocToTextSimpleExtractor *extractor)
Gets parsed plain text from a DocToTextSimpleExtractor object.
struct DocToTextSimpleExtractor DocToTextSimpleExtractor
DllExport DocToTextSimpleExtractor *DOCTOTEXT_CALL doctotext_create_simple_extractor(const char *file_name)
Creates a new DocToTextSimpleExtractor object. Example:

SimpleExtractor also supports custom callback functions, so we can define our own transform or filter functions and use them in the parsing process.

example for C++:

#include <iostream>
#include "parser.h"
#include "simple_extractor.h"
int main(int argc, char* argv[])
{
doctotext::SimpleExtractor simple_extractor(argv[1]); // create a simple extractor
simple_extractor.addCallbackFunction([](doctotext::Info &info)
{
if (info.tag_name == doctotext::StandardTag::TAG_MAIL) // if current node is mail
{
auto date = info.getAttributeValue<int>("date"); // get the date attribute
if (date) // if date attribute exists
{
if (*date < 1651437232) // if date is less than 01.05.2022 (1651437232 is the unix timestamp of 01.05.2022)
{
info.skip = true; // skip the current node
}
}
}
});
std::cout << simple_extractor.getPlainText(); // print the plain text of the document
}

Example for C:

#include <stdbool.h>
#include "stdio.h"
#include "string.h"
void filterMailsByDate(DocToTextInfo* info, void* data) // callback function to filter by date
{
const char * tag_name = doctotext_info_get_tag_name(info); // get the tag name of current node
if (strcmp(tag_name, "mail-header") == 0) // if current node is mail header
{
unsigned int date = doctotext_info_get_uint_attribute(info, "date"); // get the date attribute
if (date < 1651437232) // if date is less than 01.05.2022 (1651437232 is the unix timestamp of 01.05.2022)
{
doctotext_info_set_skip(info, true); // skip the current node
}
}
}
int main(int argc, char *argv[])
{
DocToTextSimpleExtractor* extractor = doctotext_create_simple_extractor(argv[1]); // create a simple extractor
if (extractor) // if extractor exists
{
doctotext_simple_extractor_add_callback_function(extractor, filterMailsByDate, NULL); // set the filter function
const char* text = doctotext_simple_extractor_get_plain_text(extractor); // get the plain text of the document. Call this function cause starts the parsing process.
printf("%s", text); // print the plain text
}
return 0;
}
DllExport void DOCTOTEXT_CALL doctotext_simple_extractor_add_callback_function(DocToTextSimpleExtractor *extractor, void(*callback)(DocToTextInfo *, void *data), void *data)
Adds a callback function to be called during parsing. Example of usage:
DllExport unsigned int DOCTOTEXT_CALL doctotext_info_get_uint_attribute(DocToTextInfo *info, const char *attribute_name)
Returns attribute value as a unsigned integer from DocToTextInfo.

Logs mechanism

Doctotext parsers generate many logs with current processing status, warnings and errors. By default all logs are sent to standard error stream, but there is an option to change it or also turn off logs mechanism. To redirect logs stream or turn on/off logs we can use doctotext::ParserParameters (Parser parameters) as below:

std::ofstream* my_log_stream = new std::ofstream("output_logs.txt");
doctotext::ParserParameters parameters("log_stream", my_log_stream); // set log stream parameter
parameters += doctotext::ParserParameters("verbose_logging", true); // turn on logs mechanism
Stores list of parsers parameters. Every parser can query ParserParameter for a specific parameter....