DocWire DocToText - Powered by Silvercoders 5.0.5
A multifaceted, data extraction software development toolkit that converts all sorts of files to plain text and html. Written in C++, this data extraction tool has a parser able to convert PST & OST files along with a brand new API for better file processing. To enhance its utility, DocToText, as a data extraction tool, can be integrated with other data mining and data analytics applications. It comes equipped with a high grade, scriptable and trainable OCR that has LSTM neural networks based character recognition. This document parser is able to extract metadata along with annotations and supports a list of formats that include: DOC, XLS, XLSB, PPT, RTF, ODF (ODT, ODS, ODP), OOXML (DOCX, XLSX, PPTX), iWork (PAGES, NUMBERS, KEYNOTE), ODFXML (FODP, FODS, FODT), PDF, EML, HTML, Outlook (PST, OST), Image (JPG, JPEG, JFIF, BMP, PNM, PNG, TIFF, WEBP) and DICOM (DCM)
example_4.cpp

[example_cpp]

[example_cpp] This example shows how to connect together many transformers. In this example we have two transformer. The first filter out mails by keyword "Hello". The second one cancels all process if reach the limit of 10 mails.

#include <iostream>
#include <memory>
#include "parser.h"
#include "importer.h"
#include "exporter.h"
#include "transformer.h"
#include "parsing_chain.h"
int main(int argc, char* argv[])
{
doctotext::Importer(argv[1]) | doctotext::TransformerFunc([](doctotext::Info &info) // Create an importer from file name and connect it to transformer
{
if (info.tag_name == doctotext::StandardTag::TAG_MAIL) // if current node is mail
{
auto subject = info.getAttributeValue<std::string>("subject"); // get the subject attribute
if (subject) // if subject attribute exists
{
if (subject->find("Hello") != std::string::npos) // if subject contains "Hello"
{
info.skip = true; // skip the current node
}
}
}
})
| doctotext::TransformerFunc([counter = 0, max_mails = 1](doctotext::Info &info) mutable // Create a transformer and connect it to previous transformer
{
if (info.tag_name == doctotext::StandardTag::TAG_MAIL) // if current node is mail
{
if (++counter > max_mails) // if counter is greater than max_mails
{
info.cancel = true; // cancel the parsing process
}
}
})
| doctotext::PlainTextExporter() // sets exporter to plain text
| std::cout; // sets output stream
return 0;
}
The Importer class. This class is used to import a file and parse it using available parsers.
Definition: importer.h:57
Exporter class for plain text output.
Definition: exporter.h:137
static const std::string TAG_MAIL
Tag for mail. Attributes: "subject": std::string, "date": uint (unix timestamp).
Definition: parser.h:82
Wraps single function (doctotext::NewNodeCallback) into Transformer object.
Definition: transformer.h:87
std::string tag_name
tag name
Definition: parser.h:100