DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
parser.h
1/***************************************************************************************************************************************************/
2/* DocToText - A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. */
3/* Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. */
4/* To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. */
5/* It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. */
6/* */
7/* This document parser is able to extract metadata along with annotations and supports a list of formats that include: */
8/* DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), */
9/* PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM) */
10/* */
11/* Copyright (c) SILVERCODERS Ltd */
12/* http://silvercoders.com */
13/* */
14/* Project homepage: */
15/* http://silvercoders.com/en/products/doctotext */
16/* https://www.docwire.io/ */
17/* */
18/* The GNU General Public License version 2 as published by the Free Software Foundation and found in the file COPYING.GPL permits */
19/* the distribution and/or modification of this application. */
20/* */
21/* Please keep in mind that any attempt to circumvent the terms of the GNU General Public License by employing wrappers, pipelines, */
22/* client/server protocols, etc. is illegal. You must purchase a commercial license if your program, which is distributed under a license */
23/* other than the GNU General Public License version 2, directly or indirectly calls any portion of this code. */
24/* Simply stop using the product if you disagree with this viewpoint. */
25/* */
26/* According to the terms of the license provided by SILVERCODERS and included in the file COPYING.COM, licensees in possession of */
27/* a current commercial license for this product may use this file. */
28/* */
29/* This program is provided WITHOUT ANY WARRANTY, not even the implicit warranty of merchantability or fitness for a particular purpose. */
30/* It is supplied in the hope that it will be useful. */
31/***************************************************************************************************************************************************/
32
33
34#ifndef PARSER_HPP
35#define PARSER_HPP
36
37#include <any>
38#include <string>
39#include <functional>
40#include <memory>
41
42#include "formatting_style.h"
43#include "parser_manager.h"
44#include "parser_parameters.h"
45#include "defines.h"
46
47namespace doctotext
48{
49
54{
55public:
56 inline static const std::string TAG_P = "p";
57 inline static const std::string TAG_CLOSE_P = "/p";
58 inline static const std::string TAG_BR = "br";
59 inline static const std::string TAG_B = "b";
60 inline static const std::string TAG_CLOSE_B = "/b";
61 inline static const std::string TAG_I = "i";
62 inline static const std::string TAG_CLOSE_I = "/i";
63 inline static const std::string TAG_U = "u";
64 inline static const std::string TAG_CLOSE_U = "/u";
65 inline static const std::string TAG_TABLE = "table";
66 inline static const std::string TAG_CLOSE_TABLE = "/table";
67 inline static const std::string TAG_TR = "tr";
68 inline static const std::string TAG_CLOSE_TR = "/tr";
69 inline static const std::string TAG_TD = "td";
70 inline static const std::string TAG_CLOSE_TD = "/td";
71 inline static const std::string TAG_TEXT = "#text";
72 inline static const std::string TAG_LINK = "a";
73 inline static const std::string TAG_CLOSE_LINK = "/a";
74 inline static const std::string TAG_STYLE = "style";
75 inline static const std::string TAG_CLOSE_STYLE = "/style";
76
77 inline static const std::string TAG_LIST = "list";
78 inline static const std::string TAG_CLOSE_LIST = "/list";
79 inline static const std::string TAG_LIST_ITEM = "list-item";
80 inline static const std::string TAG_CLOSE_LIST_ITEM = "/list-item";
81
82 inline static const std::string TAG_MAIL = "mail";
83 inline static const std::string TAG_CLOSE_MAIL = "/mail";
84 inline static const std::string TAG_MAIL_BODY = "mail-body";
85 inline static const std::string TAG_CLOSE_MAIL_BODY = "/mail-body";
86 inline static const std::string TAG_ATTACHMENT = "attachment";
87 inline static const std::string TAG_CLOSE_ATTACHMENT = "/attachment";
88 inline static const std::string TAG_FOLDER = "folder";
89 inline static const std::string TAG_CLOSE_FOLDER = "/folder";
90
91 inline static const std::string TAG_METADATA = "metadata";
92 inline static const std::string TAG_COMMENT = "comment";
93
94 inline static const std::string TAG_PAGE = "new-page";
95 inline static const std::string TAG_CLOSE_PAGE = "/new-page";
96};
97
98struct DllExport Info
99{
100 std::string tag_name;
101 std::map<std::string, std::any> attributes;
102 bool cancel = false;
103 bool skip = false;
104 std::string plain_text;
105
106 explicit Info(const std::string &tagName = "", const std::string &plainText = "", const std::map<std::string, std::any> &attrs = {})
107 : tag_name(tagName),
108 plain_text(plainText),
109 attributes(attrs)
110 {}
111
112 template<typename T>
113 std::optional<T> getAttributeValue(const std::string &name) const
114 {
115 auto attribute_value = attributes.find(name);
116 if (attribute_value!= attributes.end() && attribute_value->second.type() == typeid(T))
117 {
118 return std::any_cast<T>(attribute_value->second);
119 }
120 return std::nullopt;
121 }
122};
123
124typedef std::function<void(Info &info)> NewNodeCallback;
125
129class DllExport Parser
130{
131public:
136 explicit Parser(const std::shared_ptr<doctotext::ParserManager> &inParserManager = nullptr);
137
138 virtual ~Parser() = default;
139
143 virtual void parse() const = 0;
150 virtual Parser &addOnNewNodeCallback(NewNodeCallback callback);
151
152 virtual Parser &withParameters(const ParserParameters &parameters);
153
154protected:
160
161 std::ostream& getLogOutStream() const;
162
163 bool isVerboseLogging() const;
164
165 Info sendTag(const std::string& tag_name, const std::string& text = "", const std::map<std::string, std::any> &attributes = {}) const;
166 Info sendTag(const Info &info) const;
167
168 std::shared_ptr<doctotext::ParserManager> m_parser_manager;
169 ParserParameters m_parameters;
170
171private:
172 struct DllExport Implementation;
173 struct DllExport ImplementationDeleter { void operator() (Implementation*); };
174 std::unique_ptr<Implementation, ImplementationDeleter> base_impl;
175};
176
177} // namespace doctotext
178#endif //PARSER_HPP
Abstract class for all parsers.
Definition: parser.h:130
virtual void parse() const =0
Executes text parsing.
Parser(const std::shared_ptr< doctotext::ParserManager > &inParserManager=nullptr)
virtual Parser & addOnNewNodeCallback(NewNodeCallback callback)
Adds new function to execute when new node will be created. Node is a part of parsed text....
FormattingStyle getFormattingStyle() const
Loads FormattingStyle from ParserParameters.
Stores list of parsers parameters. Every parser can query ParserParameter for a specific parameter....
Contains set of basic tags using in parsers.
Definition: parser.h:54
static const std::string TAG_B
Tag for bold.
Definition: parser.h:59
static const std::string TAG_ATTACHMENT
Tag for attachment. If you set skip in this tag, then the attachment won't be parsed....
Definition: parser.h:86
static const std::string TAG_FOLDER
Tag for folder. If you set skip in this tag, then the folder won't be parsed. Attributes: "name": std...
Definition: parser.h:88
static const std::string TAG_TEXT
Tag for text.
Definition: parser.h:71
static const std::string TAG_MAIL_BODY
Tag for mail body.
Definition: parser.h:84
static const std::string TAG_MAIL
Tag for mail. Attributes: "subject": std::string, "date": uint (unix timestamp).
Definition: parser.h:82
static const std::string TAG_P
Tag for paragraph.
Definition: parser.h:56
static const std::string TAG_CLOSE_TR
Tag for closing table row.
Definition: parser.h:68
static const std::string TAG_PAGE
Tag for page. This tag is sent before parsing the page, so if we set in this tag, then the page won't...
Definition: parser.h:94
static const std::string TAG_CLOSE_B
Tag for closing bold.
Definition: parser.h:60
static const std::string TAG_TABLE
Tag for table.
Definition: parser.h:65
static const std::string TAG_CLOSE_ATTACHMENT
Tag for closing attachment.
Definition: parser.h:87
static const std::string TAG_CLOSE_LIST_ITEM
Tag for closing list item.
Definition: parser.h:80
static const std::string TAG_CLOSE_PAGE
Tag for closing page.
Definition: parser.h:95
static const std::string TAG_STYLE
Tag for style.
Definition: parser.h:74
static const std::string TAG_COMMENT
Tag for comments. Attributes: "author": std::string, "time": std::string (format:(yyyy-mm-ddThh:mm:ss...
Definition: parser.h:92
static const std::string TAG_TR
Tag for table row.
Definition: parser.h:67
static const std::string TAG_TD
Tag for table cell.
Definition: parser.h:69
static const std::string TAG_CLOSE_P
Tag for closing paragraph.
Definition: parser.h:57
static const std::string TAG_CLOSE_STYLE
Tag for close style.
Definition: parser.h:75
static const std::string TAG_U
Tag for underline.
Definition: parser.h:63
static const std::string TAG_BR
Tag for line break.
Definition: parser.h:58
static const std::string TAG_CLOSE_LINK
Tag for link.
Definition: parser.h:73
static const std::string TAG_CLOSE_TD
Tag for closing table cell.
Definition: parser.h:70
static const std::string TAG_METADATA
Tag for metadata.
Definition: parser.h:91
static const std::string TAG_CLOSE_I
Tag for closing italic.
Definition: parser.h:62
static const std::string TAG_I
Tag for italic.
Definition: parser.h:61
static const std::string TAG_CLOSE_TABLE
Tag for closing table.
Definition: parser.h:66
static const std::string TAG_LINK
Tag for link. Attributes: "url": std::string.
Definition: parser.h:72
static const std::string TAG_CLOSE_U
Tag for closing underline.
Definition: parser.h:64
static const std::string TAG_CLOSE_MAIL_BODY
Tag for closing mail body.
Definition: parser.h:85
static const std::string TAG_CLOSE_MAIL
Tag for closing mail.
Definition: parser.h:83
static const std::string TAG_CLOSE_LIST
Tag for closing list.
Definition: parser.h:78
static const std::string TAG_CLOSE_FOLDER
Tag for closing folder.
Definition: parser.h:89
static const std::string TAG_LIST_ITEM
Tag for list item.
Definition: parser.h:79
static const std::string TAG_LIST
Tag for list. Attributes: "is_ordered": bool (def. is false), "list_style_prefix": std::string.
Definition: parser.h:77
std::map< std::string, std::any > attributes
tag attributes
Definition: parser.h:101
std::string tag_name
tag name
Definition: parser.h:100
std::string plain_text
Stores text from last parsed node.
Definition: parser.h:104