DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
doctotext_c_api.h File Reference

File contains c api for doctotext software. More...

#include <stdbool.h>
#include "defines.h"
Include dependency graph for doctotext_c_api.h:

Go to the source code of this file.

Typedefs

typedef struct DocToTextParserManager DocToTextParserManager
 
typedef struct DocToTextItem DocToTextItem
 
typedef struct DocToTextParser DocToTextParser
 
typedef struct DocToTextInfo DocToTextInfo
 
typedef struct DocToTextParameters DocToTextParameters
 
typedef struct DocToTextWriter DocToTextWriter
 
typedef struct DocToTextImporter DocToTextImporter
 
typedef struct DocToTextExporter DocToTextExporter
 
typedef struct DocToTextTransformer DocToTextTransformer
 
typedef struct DocToTextParsingChain DocToTextParsingChain
 
typedef struct DocToTextSimpleExtractor DocToTextSimpleExtractor
 

Functions

DllExport DocToTextSimpleExtractor *DOCTOTEXT_CALL doctotext_create_simple_extractor (const char *file_name)
 Creates a new DocToTextSimpleExtractor object. Example: More...
 
DllExport const char *DOCTOTEXT_CALL doctotext_simple_extractor_get_plain_text (DocToTextSimpleExtractor *extractor)
 Gets parsed plain text from a DocToTextSimpleExtractor object. More...
 
DllExport void DOCTOTEXT_CALL doctotext_simple_extractor_add_callback_function (DocToTextSimpleExtractor *extractor, void(*callback)(DocToTextInfo *, void *data), void *data)
 Adds a callback function to be called during parsing. Example of usage: More...
 
DllExport DocToTextImporter *DOCTOTEXT_CALL doctotext_create_importer_from_file_name (DocToTextParserManager *manager, const char *file_name)
 Creates a new DocToTextImporter object. This object is used to import a file and parse it using available parsers. Properly parser is selected based on file extension. More...
 
DllExport DocToTextImporter *DOCTOTEXT_CALL doctotext_create_importer_from_stream (DocToTextParserManager *manager, FILE *input_stream)
 Creates a new DocToTextImporter object. This object is used to import a data from input stream and parse it using available parsers. More...
 
DllExport DocToTextExporter *DOCTOTEXT_CALL doctotext_create_plain_text_exporter (FILE *output_stream)
 Creates a new DocToTextExporter object. This object is used to export parsed data to output as a plain text. More...
 
DllExport DocToTextExporter *DOCTOTEXT_CALL doctotext_create_html_exporter (FILE *output_stream)
 Creates a new DocToTextExporter object. This object is used to export parsed data to output as a html. More...
 
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_importer_to_exporter (DocToTextImporter *importer, DocToTextExporter *exporter)
 Creates connection between importer and exporter and returns DocToTextParsingChain which contains all defined steps of the parsing chain. More...
 
DllExport DocToTextTransformer *DOCTOTEXT_CALL doctotext_create_transfomer (void(*callback)(DocToTextInfo *, void *data), void *data)
 Creates a new DocToTextTransformer object. This object is used to transform parsed data. Example of usage: More...
 
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_importer_to_transformer (DocToTextImporter *importer, DocToTextTransformer *transformer)
 Creates connection between importer and transformer and returns DocToTextParsingChain which contains all defined steps of the parsing chain. More...
 
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_parsing_chain_to_transformer (DocToTextParsingChain *parsing_chain, DocToTextTransformer *transformer)
 Adds transformer to the parsing chain. More...
 
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_parsing_chain_to_exporter (DocToTextParsingChain *parsing_chain, DocToTextExporter *exporter)
 Adds exporter to the parsing chain. More...
 
DllExport void DOCTOTEXT_CALL doctotext_parsing_chain_set_input (DocToTextParsingChain *parsing_chain, FILE *input_stream)
 Adds input stream to the parsing chain. This function starts parsing chain. More...
 
DllExport void DOCTOTEXT_CALL doctotext_free_importer (DocToTextImporter *importer)
 Frees importer and all resources allocated by the importer. DocToTextImporter is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_importer (which uses operator delete). More...
 
DllExport void DOCTOTEXT_CALL doctotext_free_exporter (DocToTextExporter *exporter)
 Frees exporter and all resources allocated by the exporter. Remember not to use function free(). DocToTextExporter is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_exporter (which uses operator delete). More...
 
DllExport void DOCTOTEXT_CALL doctotext_free_transformer (DocToTextTransformer *transformer)
 Frees transformer and all resources allocated by the transformer. Remember not to use function free(). DocToTextTransformer is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_transformer (which uses operator delete). More...
 
DllExport void DOCTOTEXT_CALL doctotext_free_parsing_chain (DocToTextParsingChain *parsing_chain)
 Frees parsing_chain and all resources allocated by the parsing chain. Remember not to use function free(). DocToTextParsingChain is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_parsing_chain (which uses operator delete). More...
 
DllExport DocToTextParserManager *DOCTOTEXT_CALL doctotext_init_parser_manager (const char *path_to_plugins)
 Creates new parser manager with all available parsers. More...
 
DllExport char ** doctotext_parser_manager_get_available_formats (DocToTextParserManager *parser_manager, unsigned int *formats_number)
 
DllExport DocToTextParser *DOCTOTEXT_CALL doctotext_parser_manager_get_parser_by_extension (DocToTextParserManager *parser_manager, const char *format)
 Returns proper parser for given format. The format is defined by file extension. Example of usage: More...
 
DllExport void DOCTOTEXT_CALL doctotext_parser_add_callback_on_new_node (DocToTextParser *parser, void(*callback)(DocToTextInfo *, void *data), void *data)
 Adds new function to execute when new node will be parsed. Node is a part of hierarchical structure. For example it could be a single file in a zip file or a single email in pst file. In case of plain structure node is an entire file. More...
 
DllExport void DOCTOTEXT_CALL doctotext_parser_add_parameters (DocToTextParser *parser, DocToTextParameters *parameters)
 Adds DocToTextParameters to parser. Every parser pass recursively DocToTextParameters to another parsers. More...
 
DllExport void DOCTOTEXT_CALL doctotext_parser_parse (DocToTextParser *parser)
 Start parsing loaded data. The data comes from file or from buffer. More...
 
DllExport void DOCTOTEXT_CALL doctotext_free_parser (DocToTextParser *parser)
 Frees parser. Remember not to use function free(). DocToTextParser is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_parser (which uses operator delete). More...
 
DllExport const char *DOCTOTEXT_CALL doctotext_info_get_plain_text (DocToTextInfo *info)
 Returns parsed text from DocToTextInfo. More...
 
DllExport const char *DOCTOTEXT_CALL doctotext_info_get_tag_name (DocToTextInfo *info)
 
DllExport const char *DOCTOTEXT_CALL doctotext_info_get_string_attribute (DocToTextInfo *info, const char *attribute_name)
 Returns attribute value as a string from DocToTextInfo. More...
 
DllExport unsigned int DOCTOTEXT_CALL doctotext_info_get_uint_attribute (DocToTextInfo *info, const char *attribute_name)
 Returns attribute value as a unsigned integer from DocToTextInfo. More...
 
DllExport void DOCTOTEXT_CALL doctotext_info_set_cancel_parser (DocToTextInfo *info, bool cancel)
 Sets cancel flag in DocToTextInfo. If cancel is true then parsing chain will be stop. Example of usage: More...
 
DllExport void DOCTOTEXT_CALL doctotext_info_set_skip (DocToTextInfo *info, bool skip)
 Sets skip flag in DocToTextInfo. If skip is true then current node will be skipped. Example of usage: More...
 
DllExport DocToTextParameters *DOCTOTEXT_CALL doctotext_create_parameter ()
 Creates new empty DocToTextParameters. In next step we can pass to DocToTextParameters required parameters like for example min_creation_time or max_creation_time. Example od usage: More...
 
DllExport void DOCTOTEXT_CALL doctotext_add_int_parameter (DocToTextParameters *parameters, const char *name, int value)
 Adds int parameter to parser parameters. More...
 
DllExport void DOCTOTEXT_CALL doctotext_add_uint_parameter (DocToTextParameters *parameters, const char *name, unsigned int value)
 Adds unsigned int parameter to parser parameters. More...
 
DllExport void DOCTOTEXT_CALL doctotext_add_float_parameter (DocToTextParameters *parameters, const char *name, float value)
 Adds float parameter to parser parameters. More...
 
DllExport void DOCTOTEXT_CALL doctotext_add_string_parameter (DocToTextParameters *parameters, const char *name, const char *value)
 Adds const char* parameter to parser parameters. More...
 
DllExport DocToTextWriter *DOCTOTEXT_CALL doctotext_create_html_writer ()
 Creates HtmlWriter. HtmlWriter writes parsed date from callbacks as html. Example of usage: More...
 
DllExport DocToTextWriter *DOCTOTEXT_CALL doctotext_create_plain_text_writer ()
 Creates PlainTextWriter. PlainTextWriter writes parsed data from callbacks as plain text.
 
DllExport void DOCTOTEXT_CALL doctotext_free_writer (DocToTextWriter *writer)
 Frees HtmlWriter. DocToTextWriter is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_html_writer (which uses operator delete). More...
 
DllExport void DOCTOTEXT_CALL doctotext_writer_write (DocToTextWriter *writer, DocToTextInfo *info, FILE *out_stream)
 Converts text from callback to html format. More...
 
DllExport void DOCTOTEXT_CALL doctotext_writer_write_header (DocToTextWriter *writer, FILE *out_stream)
 Returns beginning of text from callbacks. More...
 
DllExport void DOCTOTEXT_CALL doctotext_writer_write_footer (DocToTextWriter *writer, FILE *out_stream)
 Returns end of text from callbacks. More...
 

Detailed Description

File contains c api for doctotext software.

Definition in file doctotext_c_api.h.

Macro Definition Documentation

◆ DOCTOTEXT_CALL

#define DOCTOTEXT_CALL

Definition at line 54 of file doctotext_c_api.h.

Typedef Documentation

◆ DocToTextExporter

See also
doctotext::Exporter

Definition at line 66 of file doctotext_c_api.h.

◆ DocToTextImporter

See also
doctotext::Importer

Definition at line 65 of file doctotext_c_api.h.

◆ DocToTextInfo

typedef struct DocToTextInfo DocToTextInfo
See also
doctotext::Info

Definition at line 61 of file doctotext_c_api.h.

◆ DocToTextItem

typedef struct DocToTextItem DocToTextItem

Definition at line 59 of file doctotext_c_api.h.

◆ DocToTextParameters

See also
doctotext::ParserParameters

Definition at line 62 of file doctotext_c_api.h.

◆ DocToTextParser

See also
doctotext::Parser

Definition at line 60 of file doctotext_c_api.h.

◆ DocToTextParserManager

◆ DocToTextParsingChain

See also
doctotext::ParsingChain

Definition at line 68 of file doctotext_c_api.h.

◆ DocToTextSimpleExtractor

◆ DocToTextTransformer

See also
doctotext::Transformer

Definition at line 67 of file doctotext_c_api.h.

◆ DocToTextWriter

See also
doctotext::Writer

Definition at line 63 of file doctotext_c_api.h.

Function Documentation

◆ doctotext_add_float_parameter()

DllExport void DOCTOTEXT_CALL doctotext_add_float_parameter ( DocToTextParameters parameters,
const char *  name,
float  value 
)

Adds float parameter to parser parameters.

Parameters
parameterspointer to parser parameters
namename of parameter
valuevalue of parameter

◆ doctotext_add_int_parameter()

DllExport void DOCTOTEXT_CALL doctotext_add_int_parameter ( DocToTextParameters parameters,
const char *  name,
int  value 
)

Adds int parameter to parser parameters.

Parameters
parameterspointer to parser parameters
namename of parameter
valuevalue of parameter

◆ doctotext_add_string_parameter()

DllExport void DOCTOTEXT_CALL doctotext_add_string_parameter ( DocToTextParameters parameters,
const char *  name,
const char *  value 
)

Adds const char* parameter to parser parameters.

Parameters
parameterspointer to parser parameters
namename of parameter
valuevalue of parameter

◆ doctotext_add_uint_parameter()

DllExport void DOCTOTEXT_CALL doctotext_add_uint_parameter ( DocToTextParameters parameters,
const char *  name,
unsigned int  value 
)

Adds unsigned int parameter to parser parameters.

Parameters
parameterspointer to parser parameters
namename of parameter
valuevalue of parameter

◆ doctotext_connect_importer_to_exporter()

DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_importer_to_exporter ( DocToTextImporter importer,
DocToTextExporter exporter 
)

Creates connection between importer and exporter and returns DocToTextParsingChain which contains all defined steps of the parsing chain.

Parameters
importer
exporter
Returns
new ParsingChain object

◆ doctotext_connect_importer_to_transformer()

DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_importer_to_transformer ( DocToTextImporter importer,
DocToTextTransformer transformer 
)

Creates connection between importer and transformer and returns DocToTextParsingChain which contains all defined steps of the parsing chain.

Parameters
importer
transformer
Returns
new ParsingChain object

◆ doctotext_connect_parsing_chain_to_exporter()

DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_parsing_chain_to_exporter ( DocToTextParsingChain parsing_chain,
DocToTextExporter exporter 
)

Adds exporter to the parsing chain.

Parameters
parsing_chainParsingChain object
exporterDocToTextExporter object
Returns
parsing_chain with added exporter

◆ doctotext_connect_parsing_chain_to_transformer()

DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_parsing_chain_to_transformer ( DocToTextParsingChain parsing_chain,
DocToTextTransformer transformer 
)

Adds transformer to the parsing chain.

Parameters
parsing_chainParsingChain object
transformerDocToTextTransformer object
Returns
parsing_chain with added transformer

◆ doctotext_create_html_exporter()

DllExport DocToTextExporter *DOCTOTEXT_CALL doctotext_create_html_exporter ( FILE *  output_stream)

Creates a new DocToTextExporter object. This object is used to export parsed data to output as a html.

Parameters
manager
Returns
new DocToTextExporter object

◆ doctotext_create_html_writer()

DllExport DocToTextWriter *DOCTOTEXT_CALL doctotext_create_html_writer ( )

Creates HtmlWriter. HtmlWriter writes parsed date from callbacks as html. Example of usage:

void onNewNodeCallback(DocToTextInfo* info, void* data)
{
HtmlWriter* writer = (HtmlWriter*)(data);
const char* html_text = doctotext_html_writer_write(writer, info);
printf("%s", html_text); // print parts of html from callback output
}
HtmlWriter* writer = doctotext_create_html_writer();
doctotext_parser_add_callback_on_new_node(parser, &onNewNodeCallback, writer);
printf("%s", doctotext_writer_write_header(writer)); // print header of html
doctotext_parser_parse(parser); // parse document
printf("%s", doctotext_writer_write_footer(writer)); // print footer of html
doctotext_free_html_writer(writer); // free writer
DllExport DocToTextWriter *DOCTOTEXT_CALL doctotext_create_html_writer()
Creates HtmlWriter. HtmlWriter writes parsed date from callbacks as html. Example of usage:
DllExport void DOCTOTEXT_CALL doctotext_writer_write_header(DocToTextWriter *writer, FILE *out_stream)
Returns beginning of text from callbacks.
DllExport void DOCTOTEXT_CALL doctotext_parser_add_callback_on_new_node(DocToTextParser *parser, void(*callback)(DocToTextInfo *, void *data), void *data)
Adds new function to execute when new node will be parsed. Node is a part of hierarchical structure....
struct DocToTextInfo DocToTextInfo
DllExport void DOCTOTEXT_CALL doctotext_writer_write_footer(DocToTextWriter *writer, FILE *out_stream)
Returns end of text from callbacks.
DllExport void DOCTOTEXT_CALL doctotext_parser_parse(DocToTextParser *parser)
Start parsing loaded data. The data comes from file or from buffer.
Returns
new HtmlWriter

◆ doctotext_create_importer_from_file_name()

DllExport DocToTextImporter *DOCTOTEXT_CALL doctotext_create_importer_from_file_name ( DocToTextParserManager manager,
const char *  file_name 
)

Creates a new DocToTextImporter object. This object is used to import a file and parse it using available parsers. Properly parser is selected based on file extension.

Parameters
managerparser manager
file_namepath to the file to be imported
Returns

◆ doctotext_create_importer_from_stream()

DllExport DocToTextImporter *DOCTOTEXT_CALL doctotext_create_importer_from_stream ( DocToTextParserManager manager,
FILE *  input_stream 
)

Creates a new DocToTextImporter object. This object is used to import a data from input stream and parse it using available parsers.

Parameters
managerparser manager
input_streamstream with input data to parse
Returns

◆ doctotext_create_parameter()

DllExport DocToTextParameters *DOCTOTEXT_CALL doctotext_create_parameter ( )

Creates new empty DocToTextParameters. In next step we can pass to DocToTextParameters required parameters like for example min_creation_time or max_creation_time. Example od usage:

DocToTextParameters* parameters = doctotext_create_parameter(); // create empty DocToTextParameters
doctotext_add_uint_parameter(parameters, "min_creation_time", 1234123); // add min_creation_time parameter
doctotext_add_uint_parameter(parameters, "max_creation_time", 1834123); // add min_creation_time parameter
doctotext_parser_add_parameter(parser, parameters); // pass all parameters to parser
DllExport void DOCTOTEXT_CALL doctotext_add_uint_parameter(DocToTextParameters *parameters, const char *name, unsigned int value)
Adds unsigned int parameter to parser parameters.
struct DocToTextParameters DocToTextParameters
DllExport DocToTextParameters *DOCTOTEXT_CALL doctotext_create_parameter()
Creates new empty DocToTextParameters. In next step we can pass to DocToTextParameters required param...
Returns
new DocToTextParameters

◆ doctotext_create_plain_text_exporter()

DllExport DocToTextExporter *DOCTOTEXT_CALL doctotext_create_plain_text_exporter ( FILE *  output_stream)

Creates a new DocToTextExporter object. This object is used to export parsed data to output as a plain text.

Parameters
manager
Returns
new DocToTextExporter object

◆ doctotext_create_simple_extractor()

DllExport DocToTextSimpleExtractor *DOCTOTEXT_CALL doctotext_create_simple_extractor ( const char *  file_name)

Creates a new DocToTextSimpleExtractor object. Example:

#include <stdbool.h>
#include "stdio.h"
#include "string.h"
void filterMailsByDate(DocToTextInfo* info, void* data) // callback function to filter by date
{
const char * tag_name = doctotext_info_get_tag_name(info); // get the tag name of current node
if (strcmp(tag_name, "mail-header") == 0) // if current node is mail header
{
unsigned int date = doctotext_info_get_uint_attribute(info, "date"); // get the date attribute
if (date < 1651437232) // if date is less than 01.05.2022 (1651437232 is the unix timestamp of 01.05.2022)
{
doctotext_info_set_skip(info, true); // skip the current node
}
}
}
int main(int argc, char *argv[])
{
DocToTextSimpleExtractor* extractor = doctotext_create_simple_extractor(argv[1]); // create a simple extractor
if (extractor) // if extractor exists
{
doctotext_simple_extractor_add_callback_function(extractor, filterMailsByDate, NULL); // set the filter function
const char* text = doctotext_simple_extractor_get_plain_text(extractor); // get the plain text of the document. Call this function cause starts the parsing process.
printf("%s", text); // print the plain text
}
return 0;
}
File contains c api for doctotext software.
DllExport void DOCTOTEXT_CALL doctotext_simple_extractor_add_callback_function(DocToTextSimpleExtractor *extractor, void(*callback)(DocToTextInfo *, void *data), void *data)
Adds a callback function to be called during parsing. Example of usage:
DllExport unsigned int DOCTOTEXT_CALL doctotext_info_get_uint_attribute(DocToTextInfo *info, const char *attribute_name)
Returns attribute value as a unsigned integer from DocToTextInfo.
DllExport const char *DOCTOTEXT_CALL doctotext_simple_extractor_get_plain_text(DocToTextSimpleExtractor *extractor)
Gets parsed plain text from a DocToTextSimpleExtractor object.
DllExport void DOCTOTEXT_CALL doctotext_info_set_skip(DocToTextInfo *info, bool skip)
Sets skip flag in DocToTextInfo. If skip is true then current node will be skipped....
struct DocToTextSimpleExtractor DocToTextSimpleExtractor
DllExport const char *DOCTOTEXT_CALL doctotext_info_get_tag_name(DocToTextInfo *info)
DllExport DocToTextSimpleExtractor *DOCTOTEXT_CALL doctotext_create_simple_extractor(const char *file_name)
Creates a new DocToTextSimpleExtractor object. Example:
Parameters
file_nameThe name of the file to be parsed.
Returns
The new DocToTextSimpleExtractor object.

◆ doctotext_create_transfomer()

DllExport DocToTextTransformer *DOCTOTEXT_CALL doctotext_create_transfomer ( void(*)(DocToTextInfo *, void *data)  callback,
void *  data 
)

Creates a new DocToTextTransformer object. This object is used to transform parsed data. Example of usage:

#include "stdio.h"
#include <string.h>
void filterMailsBySubject(DocToTextInfo* info, void* data) // callback function to filter by subject text
{
const char * tag_name = doctotext_info_get_tag_name(info); // get the tag name of current node
if (strcmp(tag_name, "mail-header") == 0) // if current node is mail header
{
const char *subject = doctotext_info_get_string_attribute(info, "subject"); // get the subject attribute
if (strstr(subject, "Hello") != 0) // if subject contains "Hello"
{
doctotext_info_set_skip(info, true); // skip the current node
}
}
}
int main(int argc, char *argv[])
{
DocToTextParserManager *manager = doctotext_init_parser_manager("plugins/"); // create a parser manager (load parsers)
const char *file_name = argv[1];
DocToTextImporter *importer = doctotext_create_importer_from_file_name(manager, file_name); // create an importer from file name
DocToTextExporter *exporter = doctotext_create_plain_text_exporter(stdout); // create an exporter to plain text and set the output stream
DocToTextInfo *transformer = doctotext_create_transfomer(filterMailsBySubject, NULL); // create a transformer and set the callback function
DocToTextParsingChain *chain = doctotext_connect_importer_to_transformer(importer, transformer); // create a parsing chain by connecting importer and transformer
chain = doctotext_connect_parsing_chain_to_exporter(chain, exporter); // connect the parsing chain to exporter (This step starts the parsing)
doctotext_free_importer(importer); // free importer
doctotext_free_transformer(transformer); // free transformer
doctotext_free_exporter(exporter); // free exporter
doctotext_free_parsing_chain(chain); // free parsing chain
return 0;
}
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_parsing_chain_to_exporter(DocToTextParsingChain *parsing_chain, DocToTextExporter *exporter)
Adds exporter to the parsing chain.
struct DocToTextParserManager DocToTextParserManager
DllExport DocToTextParserManager *DOCTOTEXT_CALL doctotext_init_parser_manager(const char *path_to_plugins)
Creates new parser manager with all available parsers.
DllExport void DOCTOTEXT_CALL doctotext_free_transformer(DocToTextTransformer *transformer)
Frees transformer and all resources allocated by the transformer. Remember not to use function free()...
DllExport DocToTextExporter *DOCTOTEXT_CALL doctotext_create_plain_text_exporter(FILE *output_stream)
Creates a new DocToTextExporter object. This object is used to export parsed data to output as a plai...
DllExport void DOCTOTEXT_CALL doctotext_free_parsing_chain(DocToTextParsingChain *parsing_chain)
Frees parsing_chain and all resources allocated by the parsing chain. Remember not to use function fr...
DllExport DocToTextImporter *DOCTOTEXT_CALL doctotext_create_importer_from_file_name(DocToTextParserManager *manager, const char *file_name)
Creates a new DocToTextImporter object. This object is used to import a file and parse it using avail...
struct DocToTextExporter DocToTextExporter
DllExport DocToTextTransformer *DOCTOTEXT_CALL doctotext_create_transfomer(void(*callback)(DocToTextInfo *, void *data), void *data)
Creates a new DocToTextTransformer object. This object is used to transform parsed data....
struct DocToTextParsingChain DocToTextParsingChain
DllExport void DOCTOTEXT_CALL doctotext_free_exporter(DocToTextExporter *exporter)
Frees exporter and all resources allocated by the exporter. Remember not to use function free()....
DllExport const char *DOCTOTEXT_CALL doctotext_info_get_string_attribute(DocToTextInfo *info, const char *attribute_name)
Returns attribute value as a string from DocToTextInfo.
DllExport void DOCTOTEXT_CALL doctotext_free_importer(DocToTextImporter *importer)
Frees importer and all resources allocated by the importer. DocToTextImporter is allocated using oper...
struct DocToTextImporter DocToTextImporter
DllExport DocToTextParsingChain *DOCTOTEXT_CALL doctotext_connect_importer_to_transformer(DocToTextImporter *importer, DocToTextTransformer *transformer)
Creates connection between importer and transformer and returns DocToTextParsingChain which contains ...
Parameters
callbackfunction to be called during transformation
datadata to be passed to the callback function
Returns
new DocToTextTransformer object

◆ doctotext_free_exporter()

DllExport void DOCTOTEXT_CALL doctotext_free_exporter ( DocToTextExporter exporter)

Frees exporter and all resources allocated by the exporter. Remember not to use function free(). DocToTextExporter is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_exporter (which uses operator delete).

Parameters
exporter

◆ doctotext_free_importer()

DllExport void DOCTOTEXT_CALL doctotext_free_importer ( DocToTextImporter importer)

Frees importer and all resources allocated by the importer. DocToTextImporter is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_importer (which uses operator delete).

Parameters
importer

◆ doctotext_free_parser()

DllExport void DOCTOTEXT_CALL doctotext_free_parser ( DocToTextParser parser)

Frees parser. Remember not to use function free(). DocToTextParser is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_parser (which uses operator delete).

Parameters
parser

◆ doctotext_free_parsing_chain()

DllExport void DOCTOTEXT_CALL doctotext_free_parsing_chain ( DocToTextParsingChain parsing_chain)

Frees parsing_chain and all resources allocated by the parsing chain. Remember not to use function free(). DocToTextParsingChain is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_parsing_chain (which uses operator delete).

Parameters
parsing_chain

◆ doctotext_free_transformer()

DllExport void DOCTOTEXT_CALL doctotext_free_transformer ( DocToTextTransformer transformer)

Frees transformer and all resources allocated by the transformer. Remember not to use function free(). DocToTextTransformer is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_transformer (which uses operator delete).

Parameters
transformer

◆ doctotext_free_writer()

DllExport void DOCTOTEXT_CALL doctotext_free_writer ( DocToTextWriter writer)

Frees HtmlWriter. DocToTextWriter is allocated using operator new (from C++) and is supposed to be deleted by doctotext_free_html_writer (which uses operator delete).

Parameters
writerHtmlWriter to release

◆ doctotext_info_get_plain_text()

DllExport const char *DOCTOTEXT_CALL doctotext_info_get_plain_text ( DocToTextInfo info)

Returns parsed text from DocToTextInfo.

Parameters
infocallback parameter. It contains information about parsed text.
Returns
parsed text

◆ doctotext_info_get_string_attribute()

DllExport const char *DOCTOTEXT_CALL doctotext_info_get_string_attribute ( DocToTextInfo info,
const char *  attribute_name 
)

Returns attribute value as a string from DocToTextInfo.

Parameters
infocallback parameter. It contains information about parsed text.
attribute_namename of attribute
Returns
attribute value

◆ doctotext_info_get_tag_name()

DllExport const char *DOCTOTEXT_CALL doctotext_info_get_tag_name ( DocToTextInfo info)
Parameters
info
Returns

◆ doctotext_info_get_uint_attribute()

DllExport unsigned int DOCTOTEXT_CALL doctotext_info_get_uint_attribute ( DocToTextInfo info,
const char *  attribute_name 
)

Returns attribute value as a unsigned integer from DocToTextInfo.

Parameters
infocallback parameter. It contains information about parsed text.
attribute_namename of attribute
Returns
attribute value

◆ doctotext_info_set_cancel_parser()

DllExport void DOCTOTEXT_CALL doctotext_info_set_cancel_parser ( DocToTextInfo info,
bool  cancel 
)

Sets cancel flag in DocToTextInfo. If cancel is true then parsing chain will be stop. Example of usage:

bool stop_parser;
void onNewNodeCallback(DocToTextInfo* info, void* data)
{
const char* text = doctotext_info_get_plain_text(info);
bool* stop_parser = (bool*)(data)
doctotext_info_set_cancel_parser(info, (*stop_parser));
}
doctotext_parser_add_callback_on_new_node(parser, &onNewNodeCallback, &stop_parser);
DllExport const char *DOCTOTEXT_CALL doctotext_info_get_plain_text(DocToTextInfo *info)
Returns parsed text from DocToTextInfo.
DllExport void DOCTOTEXT_CALL doctotext_info_set_cancel_parser(DocToTextInfo *info, bool cancel)
Sets cancel flag in DocToTextInfo. If cancel is true then parsing chain will be stop....
Parameters
infoinput/output structure from callback
cancel

◆ doctotext_info_set_skip()

DllExport void DOCTOTEXT_CALL doctotext_info_set_skip ( DocToTextInfo info,
bool  skip 
)

Sets skip flag in DocToTextInfo. If skip is true then current node will be skipped. Example of usage:

#include <stdbool.h>
#include "stdio.h"
#include "string.h"
struct callbackData
{
DocToTextWriter *writer;
};
void onNewNodeCallback(DocToTextInfo* info, void* data) // callback function for new node
{
struct callbackData* p_callback_data = (struct callbackData*)(data); // get callback data
doctotext_writer_write(p_callback_data->writer, info, stdout); // write the parsed text to the output stream
}
void filterMailsBySubject(DocToTextInfo* info, void* data) // callback function to filter by subject text
{
const char * tag_name = doctotext_info_get_tag_name(info); // get the tag name of current node
if (strcmp(tag_name, "mail-header") == 0) // if current node is mail header
{
const char *subject = doctotext_info_get_string_attribute(info, "subject"); // get the subject attribute
if (strstr(subject, "Hello") != 0) // if subject contains "Hello"
{
doctotext_info_set_skip(info, true); // skip the current node
}
}
}
int main(int argc, char *argv[])
{
DocToTextParserManager *parser_manager = doctotext_init_parser_manager("plugins/"); // create a parser manager (load parsers)
DocToTextParser* parser = doctotext_parser_manager_get_parser_by_extension(parser_manager, argv[1]); // get parser from item
struct callbackData callback_data; // create a callback data structure
DocToTextWriter* writer = doctotext_create_html_writer(); // create a writer (html writer)
callback_data.writer = writer;
doctotext_parser_add_callback_on_new_node(parser, &filterMailsBySubject, NULL); // add callback function for filter by subject text
doctotext_parser_add_callback_on_new_node(parser, &onNewNodeCallback, &callback_data); // add callback function to write the parsed text to the output stream
doctotext_writer_write_header(writer, stdout); // write the header of the output file
doctotext_parser_parse(parser); // parse the document
doctotext_writer_write_footer(writer, stdout); // write the footer of the output file
doctotext_free_parser(parser); // free parser
return 0;
}
struct DocToTextParser DocToTextParser
DllExport void DOCTOTEXT_CALL doctotext_writer_write(DocToTextWriter *writer, DocToTextInfo *info, FILE *out_stream)
Converts text from callback to html format.
DllExport DocToTextParser *DOCTOTEXT_CALL doctotext_parser_manager_get_parser_by_extension(DocToTextParserManager *parser_manager, const char *format)
Returns proper parser for given format. The format is defined by file extension. Example of usage:
DllExport void DOCTOTEXT_CALL doctotext_free_parser(DocToTextParser *parser)
Frees parser. Remember not to use function free(). DocToTextParser is allocated using operator new (f...
struct DocToTextWriter DocToTextWriter
Parameters
infoinput/output structure from callback
skiptrue if node should be skipped

◆ doctotext_init_parser_manager()

DllExport DocToTextParserManager *DOCTOTEXT_CALL doctotext_init_parser_manager ( const char *  path_to_plugins)

Creates new parser manager with all available parsers.

Parameters
path_to_pluginsPath to plugins directory.
Returns
Handle to new parser

◆ doctotext_parser_add_callback_on_new_node()

DllExport void DOCTOTEXT_CALL doctotext_parser_add_callback_on_new_node ( DocToTextParser parser,
void(*)(DocToTextInfo *, void *data)  callback,
void *  data 
)

Adds new function to execute when new node will be parsed. Node is a part of hierarchical structure. For example it could be a single file in a zip file or a single email in pst file. In case of plain structure node is an entire file.

bool print_in_terminal;
void onNewNodeCallback(DocToTextInfo* info, void* data)
{
const char* text = doctotext_info_get_plain_text(info);
bool* print_in_terminal = (bool*)(data)
if ((*print_in_terminal) == true)
{
printf(text);
}
}
doctotext_parser_add_callback_on_new_node(parser, &onNewNodeCallback, &print_in_terminal);
Parameters
parser
callback
datathis pointer to data will be passed as an output parameter in callback function

◆ doctotext_parser_add_parameters()

DllExport void DOCTOTEXT_CALL doctotext_parser_add_parameters ( DocToTextParser parser,
DocToTextParameters parameters 
)

Adds DocToTextParameters to parser. Every parser pass recursively DocToTextParameters to another parsers.

Parameters
parser
parameter

◆ doctotext_parser_manager_get_available_formats()

DllExport char ** doctotext_parser_manager_get_available_formats ( DocToTextParserManager parser_manager,
unsigned int *  formats_number 
)
Parameters
parser_manager
formats_numbernumber of supported formats
Returns
names of all supported formats

◆ doctotext_parser_manager_get_parser_by_extension()

DllExport DocToTextParser *DOCTOTEXT_CALL doctotext_parser_manager_get_parser_by_extension ( DocToTextParserManager parser_manager,
const char *  format 
)

Returns proper parser for given format. The format is defined by file extension. Example of usage:

#include <stdbool.h>
#include "stdio.h"
#include "string.h"
struct callbackData
{
DocToTextWriter *writer;
};
void onNewNodeCallback(DocToTextInfo* info, void* data) // callback function for new node
{
struct callbackData* p_callback_data = (struct callbackData*)(data); // get callback data
doctotext_writer_write(p_callback_data->writer, info, stdout); // write the parsed text to the output stream
}
int main(int argc, char *argv[])
{
DocToTextParserManager *parser_manager = doctotext_init_parser_manager("plugins/"); // create a parser manager (load parsers)
DocToTextParser* parser = doctotext_parser_manager_get_parser_by_extension(parser_manager, argv[1]); // get parser from item
struct callbackData callback_data; // create a callback data structure
DocToTextWriter* writer = doctotext_create_html_writer(); // create a writer (html writer)
callback_data.writer = writer;
doctotext_parser_add_callback_on_new_node(parser, &onNewNodeCallback, &callback_data); // add callback function for new node
doctotext_writer_write_header(writer, stdout); // write the header of the output file
doctotext_parser_parse(parser); // parse the document
doctotext_writer_write_footer(writer, stdout); // write the footer of the output file
doctotext_free_parser(parser); // free parser
return 0;
}
Parameters
parser_manager
format
Returns
parser for given format

◆ doctotext_parser_parse()

DllExport void DOCTOTEXT_CALL doctotext_parser_parse ( DocToTextParser parser)

Start parsing loaded data. The data comes from file or from buffer.

Parameters
parser

◆ doctotext_parsing_chain_set_input()

DllExport void DOCTOTEXT_CALL doctotext_parsing_chain_set_input ( DocToTextParsingChain parsing_chain,
FILE *  input_stream 
)

Adds input stream to the parsing chain. This function starts parsing chain.

Parameters
parsing_chainParsingChain object
input_streaminput stream

◆ doctotext_simple_extractor_add_callback_function()

DllExport void DOCTOTEXT_CALL doctotext_simple_extractor_add_callback_function ( DocToTextSimpleExtractor extractor,
void(*)(DocToTextInfo *, void *data)  callback,
void *  data 
)

Adds a callback function to be called during parsing. Example of usage:

#include <stdbool.h>
#include "stdio.h"
#include "string.h"
void filterMailsByDate(DocToTextInfo* info, void* data) // callback function to filter by date
{
const char * tag_name = doctotext_info_get_tag_name(info); // get the tag name of current node
if (strcmp(tag_name, "mail-header") == 0) // if current node is mail header
{
unsigned int date = doctotext_info_get_uint_attribute(info, "date"); // get the date attribute
if (date < 1651437232) // if date is less than 01.05.2022 (1651437232 is the unix timestamp of 01.05.2022)
{
doctotext_info_set_skip(info, true); // skip the current node
}
}
}
int main(int argc, char *argv[])
{
DocToTextSimpleExtractor* extractor = doctotext_create_simple_extractor(argv[1]); // create a simple extractor
if (extractor) // if extractor exists
{
doctotext_simple_extractor_add_callback_function(extractor, filterMailsByDate, NULL); // set the filter function
const char* text = doctotext_simple_extractor_get_plain_text(extractor); // get the plain text of the document. Call this function cause starts the parsing process.
printf("%s", text); // print the plain text
}
return 0;
}
Parameters
extractorThe DocToTextSimpleExtractor object.
callbackThe callback function.
dataThe data to be passed to the callback function.

◆ doctotext_simple_extractor_get_plain_text()

DllExport const char *DOCTOTEXT_CALL doctotext_simple_extractor_get_plain_text ( DocToTextSimpleExtractor extractor)

Gets parsed plain text from a DocToTextSimpleExtractor object.

Parameters
extractorThe DocToTextSimpleExtractor object.
Returns
The parsed plain text.

◆ doctotext_writer_write()

DllExport void DOCTOTEXT_CALL doctotext_writer_write ( DocToTextWriter writer,
DocToTextInfo info,
FILE *  out_stream 
)

Converts text from callback to html format.

Parameters
writerHtmlWriter
infoinput/output structure from callback

◆ doctotext_writer_write_footer()

DllExport void DOCTOTEXT_CALL doctotext_writer_write_footer ( DocToTextWriter writer,
FILE *  out_stream 
)

Returns end of text from callbacks.

Parameters
writerHtmlWriter

◆ doctotext_writer_write_header()

DllExport void DOCTOTEXT_CALL doctotext_writer_write_header ( DocToTextWriter writer,
FILE *  out_stream 
)

Returns beginning of text from callbacks.

Parameters
writerHtmlWriter