DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
doctotext::SimpleExtractor Class Reference

The SimpleExtractor class provides basic functionality for extracting text from a document. More...

#include <simple_extractor.h>

Public Member Functions

 SimpleExtractor (const std::string &file_name, const std::string &plugins_path="./plugins")
 
 SimpleExtractor (std::istream &input_stream, const std::string &plugins_path="./plugins")
 
std::string getPlainText () const
 Extracts the text from the file. More...
 
std::string getHtmlText () const
 Extracts the data from the file and converts it to the html format. More...
 
void parseAsPlainText (std::ostream &out_stream) const
 
void parseAsHtml (std::ostream &out_stream) const
 
std::string getMetaData () const
 Extracts the meta data from the file. More...
 
void setFormattingStyle (const FormattingStyle &style)
 Sets the formatting style. More...
 
void addTransformer (Transformer *transformer)
 Adds callback function to the extractor. More...
 

Detailed Description

The SimpleExtractor class provides basic functionality for extracting text from a document.

SimpleExtractor extractor("test.docx");
std::string plain_text = extractor.getPlainText(); // get the plain text from the document
std::string html = extractor.getHtmlText(); // get the text as a html from the document
std::string metadata = extractor.getMetadata(); // get the metadata as a plain text from the document
The SimpleExtractor class provides basic functionality for extracting text from a document.
Examples
example_7.cpp, example_8.cpp, and example_9.cpp.

Definition at line 61 of file simple_extractor.h.

Constructor & Destructor Documentation

◆ SimpleExtractor() [1/2]

doctotext::SimpleExtractor::SimpleExtractor ( const std::string &  file_name,
const std::string &  plugins_path = "./plugins" 
)
explicit
Parameters
file_namename of the file to parse

◆ SimpleExtractor() [2/2]

doctotext::SimpleExtractor::SimpleExtractor ( std::istream &  input_stream,
const std::string &  plugins_path = "./plugins" 
)
Parameters
input_streaminput stream to parse

Member Function Documentation

◆ addTransformer()

void doctotext::SimpleExtractor::addTransformer ( Transformer transformer)

Adds callback function to the extractor.

extractor.addCallbackFunction(StandardFilter::filterByMailMaxCreationTime(creation_time));
@brief
@param filter
/
void addCallbackFunction(NewNodeCallback new_code_callback);
void addParameters(const ParserParameters &parameters);
Stores list of parsers parameters. Every parser can query ParserParameter for a specific parameter....
static doctotext::NewNodeCallback filterByMailMaxCreationTime(unsigned int max_time)
Filters mail by creation date. Keeps only mails that are created before the given date.
Parameters
transformeras a raw pointer. The ownership is transferred to the extractor.

◆ getHtmlText()

std::string doctotext::SimpleExtractor::getHtmlText ( ) const

Extracts the data from the file and converts it to the html format.

Returns
parsed file ashtml text

◆ getMetaData()

std::string doctotext::SimpleExtractor::getMetaData ( ) const

Extracts the meta data from the file.

Returns
parsed meta data as plain text

◆ getPlainText()

std::string doctotext::SimpleExtractor::getPlainText ( ) const

Extracts the text from the file.

Returns
parsed file as plain text
Examples
example_7.cpp, and example_8.cpp.

◆ setFormattingStyle()

void doctotext::SimpleExtractor::setFormattingStyle ( const FormattingStyle style)

Sets the formatting style.

Parameters
style

The documentation for this class was generated from the following file: