DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)

Plugins

There is an option to adding our own parsers to doctotext. In first step we need to create a parser class inheriting from doctotext::Parser and a parser builder inheriting from doctotext::ParserBuilder. In parser there are two abstract functions to overriding: parse() and onNewNode(). According to the documentation of doctototext::Parser, the onNewNode function adds functions to call when new node will is created. Nodes are created during parsing process. The single node could be, for example, part of parsed text, email or folder. Every new node is passed to a callback as a doctotext::Info structure. In Info structure, there is a field "tag" which describes the type of node (in doctotext::StandardTag there is a description of all available tags). In the simplest case node could contain all parsed text. The parsing process starts when parse() function will be called. Instead of creating our own parser builder, there is an option to use ParserBuilderWrapper. This is a class template which provides the basic parser building support. It is a sufficient mechanism for most usage.

{
void parse() const override
{
// parsing process (see doctotext::Parser)
}
virtual Parser &addOnNewNodeCallback(NewNodeCallback callback) override
{
// manage callbacks (see doctotext::Parser)
return *this;
}
};
{
std::unique_ptr<Parser> build(const std::string &inFileName) const override
{
// build new parser from file (see doctotext::ParserBuilder)
}
std::unique_ptr<Parser> build(const char *buffer, size_t size) const override
{
// build new parser from data buffer (see doctotext::ParserBuilder)
}
};
Abstract class for all parsers.
Definition: parser.h:130
Parser(const std::shared_ptr< doctotext::ParserManager > &inParserManager=nullptr)

After that we have to create own parser provider inheriting from doctotext::ParserProvider. Important: for handle plugins mechanism we use boost library, so using boost header boost/config.hpp is additional requirement.

#include <boost/config.hpp>
{
std::optional<ParserBuilder*> findParserByExtension(const std::string &extension) const override
{
return std::nullopt;
}
std::optional<ParserBuilder*> findParserByData(const std::vector<char>& buffer) const override
{
return std::nullopt;
}
std::set<std::string> getAvailableExtensions() const override
{
return {".custom"};
}
};
extern "C" BOOST_SYMBOL_EXPORT CustomParserProvider custom_parser_provider;
CustomParserProvider custom_parser_provider;
[plugin_example_1]
Definition: example_9.cpp:53
The ParserProvider class.

In the next step we need to generate library for each OS where parser will be used (e.g. shared library (so) for linux or dynamic link library (dll) for windows) and add it to the special directory (e.g. plugins). Finally, if we would like to use our new custom parser in doctotext we should pass path to directory where we keep our plugins.

int main(int argc, char* argv[])
{
doctotext::SimpleExtractor extractor("file.custom", "path_tp_directory_with_plugins");
std::cout << extractor.getPlainText();
return 0;
}
The SimpleExtractor class provides basic functionality for extracting text from a document.

In the future, there will be option to add our own importer, transformer or exporter in similar way like the parsers (using plugin mechanism). At this moment if we would like to create own importer/transformer/exporter we should add the code directly in own application.