|
DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
|
The SimpleExtractor class provides basic functionality for extracting text from a document. More...
#include <simple_extractor.h>
Public Member Functions | |
| SimpleExtractor (const std::string &file_name, const std::string &plugins_path="./plugins") | |
| SimpleExtractor (std::istream &input_stream, const std::string &plugins_path="./plugins") | |
| std::string | getPlainText () const |
| Extracts the text from the file. More... | |
| std::string | getHtmlText () const |
| Extracts the data from the file and converts it to the html format. More... | |
| void | parseAsPlainText (std::ostream &out_stream) const |
| void | parseAsHtml (std::ostream &out_stream) const |
| std::string | getMetaData () const |
| Extracts the meta data from the file. More... | |
| void | setFormattingStyle (const FormattingStyle &style) |
| Sets the formatting style. More... | |
| void | addTransformer (Transformer *transformer) |
| Adds callback function to the extractor. More... | |
The SimpleExtractor class provides basic functionality for extracting text from a document.
Definition at line 61 of file simple_extractor.h.
|
explicit |
| file_name | name of the file to parse |
| doctotext::SimpleExtractor::SimpleExtractor | ( | std::istream & | input_stream, |
| const std::string & | plugins_path = "./plugins" |
||
| ) |
| input_stream | input stream to parse |
| void doctotext::SimpleExtractor::addTransformer | ( | Transformer * | transformer | ) |
Adds callback function to the extractor.
| transformer | as a raw pointer. The ownership is transferred to the extractor. |
| std::string doctotext::SimpleExtractor::getHtmlText | ( | ) | const |
Extracts the data from the file and converts it to the html format.
| std::string doctotext::SimpleExtractor::getMetaData | ( | ) | const |
Extracts the meta data from the file.
| std::string doctotext::SimpleExtractor::getPlainText | ( | ) | const |
Extracts the text from the file.
| void doctotext::SimpleExtractor::setFormattingStyle | ( | const FormattingStyle & | style | ) |
Sets the formatting style.
| style |