|
DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
|
There is an option to adding our own parsers to doctotext. In first step we need to create a parser class inheriting from doctotext::Parser and a parser builder inheriting from doctotext::ParserBuilder. In parser there are two abstract functions to overriding: parse() and onNewNode(). According to the documentation of doctototext::Parser, the onNewNode function adds functions to call when new node will is created. Nodes are created during parsing process. The single node could be, for example, part of parsed text, email or folder. Every new node is passed to a callback as a doctotext::Info structure. In Info structure, there is a field "tag" which describes the type of node (in doctotext::StandardTag there is a description of all available tags). In the simplest case node could contain all parsed text. The parsing process starts when parse() function will be called. Instead of creating our own parser builder, there is an option to use ParserBuilderWrapper. This is a class template which provides the basic parser building support. It is a sufficient mechanism for most usage.
After that we have to create own parser provider inheriting from doctotext::ParserProvider. Important: for handle plugins mechanism we use boost library, so using boost header boost/config.hpp is additional requirement.
In the next step we need to generate library for each OS where parser will be used (e.g. shared library (so) for linux or dynamic link library (dll) for windows) and add it to the special directory (e.g. plugins). Finally, if we would like to use our new custom parser in doctotext we should pass path to directory where we keep our plugins.
In the future, there will be option to add our own importer, transformer or exporter in similar way like the parsers (using plugin mechanism). At this moment if we would like to create own importer/transformer/exporter we should add the code directly in own application.