DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
example_6.cpp
1
2#include <algorithm>
3#include <iostream>
4#include <memory>
5
6#include "parser.h"
7#include "parser_builder.h"
8#include "plain_text_writer.h"
9
13int main(int argc, char* argv[])
14{
15 auto parser_manager = std::make_shared<doctotext::ParserManager>(); // Create parser manager (load parsers)
16 std::string path = argv[1];
17 auto parser_builder = parser_manager->findParserByExtension(path); // get the parser builder by extension
18 doctotext::PlainTextWriter plain_text_writer; // create a plain text writer
19 plain_text_writer.write_header(std::cout); // write the header to the output stream
20 if (parser_builder) // if parser builder exists
21 {
22 (*parser_builder)->withParserManager(parser_manager) // set the parser manager
23 .build(path) // build the parser
24 ->addOnNewNodeCallback([](doctotext::Info &info) // add a callback function to filter by subject text
25 {
26 if (info.tag_name ==
27 doctotext::StandardTag::TAG_MAIL) // if current node is mail
28 {
29 auto subject = info.getAttributeValue<std::string>(
30 "subject"); // get the subject attribute
31 if (subject) // if subject attribute exists
32 {
33 if (subject->find("Hello") != std::string::npos) // if subject contains "Hello"
34 {
35 info.skip = true; // skip the current node
36 }
37 }
38 }
39 })
40 .addOnNewNodeCallback([&plain_text_writer](
41 doctotext::Info &info) // add callback function to write the parsed text to the output stream
42 {
43 plain_text_writer.write_to(info, std::cout); // write the node to the output stream
44 })
45 .parse(); // start the parsing process
46 }
47 plain_text_writer.write_footer(std::cout); // write the footer to the output stream
48 return 0;
49}
void write_footer(std::ostream &stream) const override
Write footer for plain text format.
void write_to(const doctotext::Info &info, std::ostream &stream) const override
Converts text from callback to plain text format.
static const std::string TAG_MAIL
Tag for mail. Attributes: "subject": std::string, "date": uint (unix timestamp).
Definition: parser.h:82
std::string tag_name
tag name
Definition: parser.h:100