|
DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
|
Abstract class for all parsers. More...
#include <parser.h>


Public Member Functions | |
| Parser (const std::shared_ptr< doctotext::ParserManager > &inParserManager=nullptr) | |
| virtual void | parse () const =0 |
| Executes text parsing. More... | |
| virtual Parser & | addOnNewNodeCallback (NewNodeCallback callback) |
| Adds new function to execute when new node will be created. Node is a part of parsed text. Depends on the kind of parser it could be. For example, email from pst file or page from pdf file. More... | |
| virtual Parser & | withParameters (const ParserParameters ¶meters) |
Protected Member Functions | |
| FormattingStyle | getFormattingStyle () const |
| Loads FormattingStyle from ParserParameters. More... | |
| std::ostream & | getLogOutStream () const |
| bool | isVerboseLogging () const |
| Info | sendTag (const std::string &tag_name, const std::string &text="", const std::map< std::string, std::any > &attributes={}) const |
| Info | sendTag (const Info &info) const |
Protected Attributes | |
| std::shared_ptr< doctotext::ParserManager > | m_parser_manager |
| ParserParameters | m_parameters |
|
explicit |
| inParserManager | parser manager contains all available parsers which could be used recursive |
|
virtual |
Adds new function to execute when new node will be created. Node is a part of parsed text. Depends on the kind of parser it could be. For example, email from pst file or page from pdf file.
| callback | function to execute |
|
protected |
Loads FormattingStyle from ParserParameters.
|
pure virtual |
Executes text parsing.
Implemented in doctotext::ParserWrapper< ParserType >.
|
protected |
|
protected |