DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
example_8.cpp
1
2#include <iostream>
3
4#include "parser.h"
5#include "simple_extractor.h"
6
10int main(int argc, char* argv[])
11{
12 doctotext::SimpleExtractor simple_extractor(argv[1]); // create a simple extractor
13 simple_extractor.addCallbackFunction([](doctotext::Info &info)
14 {
15 if (info.tag_name == doctotext::StandardTag::TAG_MAIL) // if current node is mail
16 {
17 auto date = info.getAttributeValue<int>("date"); // get the date attribute
18 if (date) // if date attribute exists
19 {
20 if (*date < 1651437232) // if date is less than 01.05.2022 (1651437232 is the unix timestamp of 01.05.2022)
21 {
22 info.skip = true; // skip the current node
23 }
24 }
25 }
26 });
27 std::cout << simple_extractor.getPlainText(); // print the plain text of the document
28}
The SimpleExtractor class provides basic functionality for extracting text from a document.
static const std::string TAG_MAIL
Tag for mail. Attributes: "subject": std::string, "date": uint (unix timestamp).
Definition: parser.h:82
std::string tag_name
tag name
Definition: parser.h:100