DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
example_9.cpp
1#include <iostream>
2#include <memory>
3#include <optional>
4
5#include "parser.h"
6#include "parser_builder.h"
7#include "parser_provider.h"
8#include "simple_extractor.h"
9
10using namespace doctotext;
11
13
19{
20 void parse() const override
21 {
22 // parsing process (see doctotext::Parser)
23 }
24
25 virtual Parser &addOnNewNodeCallback(NewNodeCallback callback) override
26 {
27 // manage callbacks (see doctotext::Parser)
28 return *this;
29 }
30};
31
33{
34 std::unique_ptr<Parser> build(const std::string &inFileName) const override
35 {
36 // build new parser from file (see doctotext::ParserBuilder)
37 }
38
39 std::unique_ptr<Parser> build(const char *buffer, size_t size) const override
40 {
41 // build new parser from data buffer (see doctotext::ParserBuilder)
42 }
43
44};
45
47
49
50#include <boost/config.hpp>
51
53{
54 std::optional<ParserBuilder*> findParserByExtension(const std::string &extension) const override
55 {
56 return std::nullopt;
57 }
58
59 std::optional<ParserBuilder*> findParserByData(const std::vector<char>& buffer) const override
60 {
61 return std::nullopt;
62 }
63
64 std::set<std::string> getAvailableExtensions() const override
65 {
66 return {".custom"};
67 }
68
69};
70
71extern "C" BOOST_SYMBOL_EXPORT CustomParserProvider custom_parser_provider;
72CustomParserProvider custom_parser_provider;
73
75
77
78int main(int argc, char* argv[])
79{
80 doctotext::SimpleExtractor extractor("file.custom", "path_tp_directory_with_plugins");
81 std::cout << extractor.getPlainText();
82 return 0;
83}
84
85
[plugin_example_1]
Definition: example_9.cpp:53
Abstract class for all parsers.
Definition: parser.h:130
The ParserProvider class.
The SimpleExtractor class provides basic functionality for extracting text from a document.