DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
simple_extractor.h
1/***************************************************************************************************************************************************/
2/* DocToText - A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. */
3/* Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. */
4/* To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. */
5/* It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. */
6/* */
7/* This document parser is able to extract metadata along with annotations and supports a list of formats that include: */
8/* DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), */
9/* PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM) */
10/* */
11/* Copyright (c) SILVERCODERS Ltd */
12/* http://silvercoders.com */
13/* */
14/* Project homepage: */
15/* http://silvercoders.com/en/products/doctotext */
16/* https://www.docwire.io/ */
17/* */
18/* The GNU General Public License version 2 as published by the Free Software Foundation and found in the file COPYING.GPL permits */
19/* the distribution and/or modification of this application. */
20/* */
21/* Please keep in mind that any attempt to circumvent the terms of the GNU General Public License by employing wrappers, pipelines, */
22/* client/server protocols, etc. is illegal. You must purchase a commercial license if your program, which is distributed under a license */
23/* other than the GNU General Public License version 2, directly or indirectly calls any portion of this code. */
24/* Simply stop using the product if you disagree with this viewpoint. */
25/* */
26/* According to the terms of the license provided by SILVERCODERS and included in the file COPYING.COM, licensees in possession of */
27/* a current commercial license for this product may use this file. */
28/* */
29/* This program is provided WITHOUT ANY WARRANTY, not even the implicit warranty of merchantability or fitness for a particular purpose. */
30/* It is supplied in the hope that it will be useful. */
31/***************************************************************************************************************************************************/
32
33#ifndef SIMPLE_EXTRACTOR_HPP
34#define SIMPLE_EXTRACTOR_HPP
35
36#include <algorithm>
37#include <memory>
38#include <string>
39#include <sstream>
40
41#include "importer.h"
42#include "exporter.h"
43#include "transformer.h"
44#include "parsing_chain.h"
45
46#include "formatting_style.h"
47#include "defines.h"
48
49namespace doctotext
50{
51
61class DllExport SimpleExtractor
62{
63public:
67 explicit SimpleExtractor(const std::string &file_name, const std::string &plugins_path = "./plugins");
68
72 SimpleExtractor(std::istream &input_stream, const std::string &plugins_path = "./plugins");
73
75
80 std::string getPlainText() const;
81
86 std::string getHtmlText() const;
87
88 void parseAsPlainText(std::ostream &out_stream) const;
89
90 void parseAsHtml(std::ostream &out_stream) const;
91
96 std::string getMetaData() const;
97
103
111 void addCallbackFunction(NewNodeCallback new_code_callback);
112
117 void addParameters(const ParserParameters &parameters);
118
126 void addTransformer(Transformer *transformer);
127
128private:
129 class Implementation;
130 std::unique_ptr<Implementation> impl;
131};
132
133
134} // namespace doctotext
135
136
137#endif //SIMPLE_EXTRACTOR_HPP
Stores list of parsers parameters. Every parser can query ParserParameter for a specific parameter....
The SimpleExtractor class provides basic functionality for extracting text from a document.
std::string getPlainText() const
Extracts the text from the file.
void addTransformer(Transformer *transformer)
Adds callback function to the extractor.
SimpleExtractor(const std::string &file_name, const std::string &plugins_path="./plugins")
void setFormattingStyle(const FormattingStyle &style)
Sets the formatting style.
std::string getMetaData() const
Extracts the meta data from the file.
SimpleExtractor(std::istream &input_stream, const std::string &plugins_path="./plugins")
std::string getHtmlText() const
Extracts the data from the file and converts it to the html format.
The Transformer transforms data from Importer or from another Transformer.
Definition: transformer.h:59