KItinerary

engine/extractorengine.h
1/*
2 SPDX-FileCopyrightText: 2017-2021 Volker Krause <vkrause@kde.org>
3
4 SPDX-License-Identifier: LGPL-2.0-or-later
5*/
6
7#pragma once
8
9#include "kitinerary_export.h"
10
11#include <QString>
12
13#include <memory>
14#include <vector>
15
16class QByteArray;
17class QDateTime;
18class QJsonArray;
19class QVariant;
20
21namespace KItinerary {
22
23class AbstractExtractor;
24class BarcodeDecoder;
25class ExtractorDocumentNode;
26class ExtractorDocumentNodeFactory;
27class ExtractorEnginePrivate;
28class ExtractorRepository;
29class ExtractorScriptEngine;
30
31/**
32 * Semantic data extraction engine.
33 *
34 * This will attempt to find travel itinerary data in the given input data
35 * (plain text, HTML text, PDF documents, etc), and return the extracted
36 * JSON-LD data.
37 *
38 * @section create_extractors Creating Extractors
39 *
40 * @subsection extractor_api Extractor API
41 *
42 * For adding custom extractors, two parts are needed:
43 * - JSON meta-data describing the extractor and when to apply it, as described
44 * in the Extractor documentation.
45 * - An extractor JavaScript file, compatible with QJSEngine.
46 *
47 * The extractor script will have access to API defined in the JsApi namespace:
48 * - JsApi::JsonLd: functions for generating JSON-LD data.
49 * - JsApi::Barcode: barcode decoding functions.
50 * - JsApi::BitArray, JsApi::ByteArray for working with binary data.
51 * - JsApi::ExtractorEngine for recursive invokation of the extractor process.
52 *
53 * The entry point to the script is specified in the meta-data, its argument depends
54 * on the extractor type:
55 * - Plain text extractors are passed a string.
56 * If input is HTML or PDF, the string will be the text of the document stripped
57 * of all formatting etc.
58 * - HTML extractors are passed a HtmlDocument instance allowing DOM-like access to
59 * the document structure.
60 * - PDF extractors are passed a PdfDocument instance allowing access to textual and
61 * image content.
62 * - Apple Wallet pass extractors are passed a KPkPass::BoardingPass instance.
63 * - iCalendar event extractors are passed KCalendarCore::Event instances.
64 * - UIC/ERA/VDV/IATA standardized ticket codes are passed as their respective types.
65 * - Binary data is passed as ArrayBuffer.
66 *
67 * These functions should return an object or an array of objects following the JSON-LD
68 * format defined on schema.org. JsApi::JsonLd provides helper functions to build such
69 * objects. If @c null or an empty array is returned, the next applicable extractor is
70 * run.
71 *
72 * Returned objects are then passed through ExtractorPostprocessor which will normalize,
73 * augment and validate the data. This can greatly simplify the extraction, as for example
74 * the expansion of an IATA BCBP ticket token already fills most key properties of a flight
75 * reservation automatically.
76 *
77 * @subsection extractor_tools Development Tools
78 *
79 * For interactive testing during development of new extractors, it is recommended to
80 * link (or copy) the JSON meta data and JavaScript code files to the search path for
81 * Extractor meta data.
82 *
83 * Additionally, there's an interactive testing and inspection tool called @c kitinerary-workbench
84 * (see https://invent.kde.org/pim/kitinerary-workbench).
85 *
86 * @subsection extractor_testing Automated Testing
87 *
88 * There are a few unit tests for extractors in the kitinerary repository (see autotests/extractordata),
89 * however the majority of real-world test data cannot be shared this way, due to privacy
90 * and copyright issues (e.g. PDFs containing copyrighted vendor logos and user credit card details).
91 * Therefore there is also support for testing against external data (see extractortest.cpp).
92 *
93 * External test data is assumed to be in a folder named @c kitinerary-tests next to the @c kitinerary
94 * source folder. The test program searches this folder recursively for folders with the following content
95 * and attempts to extract data from each test file in there.
96 *
97 * - @c context.eml: MIME message header data specifying the context in which the test data
98 * was received. This typically only needs a @c From: and @c Date: line, but can even be
99 * entirely empty (or non-existing) for structured data that does not need a custom extractor.
100 * This context information is applied to all tests in this folder.
101 * - @c <testname>.[txt|html|pdf|pkpass|ics|eml|mbox]: The input test data.
102 * - @c <testname.extension>.json: The expected JSON-LD output. If this file doesn't
103 * exists it is created by the test program.
104 * - @c <testname.extension>.skip: If this file is present the corresponding test
105 * is skipped.
106 */
107class KITINERARY_EXPORT ExtractorEngine
108{
109public:
111 ExtractorEngine(ExtractorEngine &&) noexcept;
112 ExtractorEngine(const ExtractorEngine &) = delete;
114
115 /** Resets the internal state, call before processing new input data. */
116 void clear();
117
118 /** Set raw data to extract from.
119 * @param data Raw data to extract from.
120 * @param fileName Used as a hint to determine the type, optional and used for MIME type auto-detection if needed.
121 * @param mimeType MIME type of @p data, auto-detected if empty.
122 */
123 void setData(const QByteArray &data, QStringView fileName = {}, QStringView mimeType = {});
124
125 /** Already decoded data to extract from.
126 * @param data Has to contain a object of a supported data type matching @p mimeType.
127 */
128 void setContent(const QVariant &data, QStringView mimeType);
129
130 /** Provide a document part that is only used to determine which extractor to use,
131 * but not for extraction itself.
132 * This can for example be the MIME message part wrapping a document to extract.
133 * Using this is not necessary when this document part is already included in
134 * what is passed to setContent() already anyway.
135 */
136 void setContext(const QVariant &data, QStringView mimeType);
137
138 /** Set the date the extracted document has been issued at.
139 * This does not need to be perfectly accurate and is used to
140 * complete incomplete date information in the document (typically
141 * a missing year).
142 * This method does not need to be called when setContext is used.
143 */
144 void setContextDate(const QDateTime &dt);
145
146 /** Perform extraction of "risky" content such as PDF files in a separate process.
147 * This is safer as it isolates the using application from crashes/hangs due to corrupt files.
148 * It is however slower, and not available on all platforms.
149 * This is off by default.
150 */
151 void setUseSeparateProcess(bool separateProcess);
152
153 /** Sets additional extractors to run on the given data.
154 * Extractors are usually automatically selected, this is therefore most likely not needed to
155 * be called manually. This mainly exists for the external extractor process.
156 */
157 void setAdditionalExtractors(std::vector<const AbstractExtractor*> &&extractors);
158
159 /** Hints about the document to extract based on application knowledge that
160 * can help the extractor.
161 */
162 enum Hint {
163 NoHint = 0,
164 ExtractFullPageRasterImages = 1, ///< perform expensive image processing on (PDF) documents containing full page raster images
165 ExtractGenericIcalEvents = 2, ///< generate Event objects for generic ical events.
166 };
167 Q_DECLARE_FLAGS(Hints, Hint)
168
169 /** The currently set extraction hints. */
170 Hints hints() const;
171 /** Set extraction hints. */
172 void setHints(Hints hints);
173
174 /** Perform the actual extraction, and return the JSON-LD data
175 * that has been found.
176 */
177 QJsonArray extract();
178
179 /** Returns the extractor id used to obtain the result.
180 * Can be empty if generic extractors have been used.
181 * Not supposed to be used for normal operations, this is only needed for tooling.
182 */
183 QString usedCustomExtractor() const;
184
185 /** Factory for creating new document nodes.
186 * This is only for use by KItinerary::ExtractorDocumentProcessor instances.
187 */
188 const ExtractorDocumentNodeFactory* documentNodeFactory() const;
189 /** Barcode decoder for use by KItinerary::ExtractorDocumentProcessor.
190 * Use this rather than your own instance as it caches repeated attempts to
191 * decode the same image.
192 */
193 const BarcodeDecoder* barcodeDecoder() const;
194
195 ///@cond internal
196 /** Extractor repository instance used by this engine. */
197 const ExtractorRepository* extractorRepository() const;
198 /** JavaScript execution engine for script extractors. */
199 const ExtractorScriptEngine* scriptEngine() const;
200 /** Document root node.
201 * Only fully populated after extraction has been performed.
202 * Only exposed for tooling.
203 */
204 ExtractorDocumentNode rootDocumentNode() const;
205 /** Process a single node.
206 * For use by the script engine, do not use manually.
207 */
208 void processNode(ExtractorDocumentNode &node) const;
209 ///@endcond
210
211private:
212 std::unique_ptr<ExtractorEnginePrivate> d;
213};
214
215Q_DECLARE_OPERATORS_FOR_FLAGS(ExtractorEngine::Hints)
216
217}
218
Barcode decoding with result caching.
Instantiates KItinerary::ExtractorDocumentNode instances using the type-specific document processor.
A node in the extracted document object tree.
Semantic data extraction engine.
Hint
Hints about the document to extract based on application knowledge that can help the extractor.
Collection of all known data extractors.
Classes for reservation/travel data models, data extraction and data augmentation.
Definition berelement.h:17
This file is part of the KDE documentation.
Documentation copyright © 1996-2024 The KDE developers.
Generated on Mon Nov 4 2024 16:28:48 by doxygen 1.12.0 written by Dimitri van Heesch, © 1997-2006

KDE's Doxygen guidelines are available online.