KItinerary::ExtractorEngine
#include <extractorengine.h>
Public Types | |
enum | Hint { NoHint = 0 , ExtractFullPageRasterImages = 1 , ExtractGenericIcalEvents = 2 } |
typedef QFlags< Hint > | Hints |
Public Member Functions | |
ExtractorEngine (const ExtractorEngine &)=delete | |
ExtractorEngine (ExtractorEngine &&) noexcept | |
const BarcodeDecoder * | barcodeDecoder () const |
void | clear () |
const ExtractorDocumentNodeFactory * | documentNodeFactory () const |
QJsonArray | extract () |
Hints | hints () const |
void | setAdditionalExtractors (std::vector< const AbstractExtractor * > &&extractors) |
void | setContent (const QVariant &data, QStringView mimeType) |
void | setContext (const QVariant &data, QStringView mimeType) |
void | setContextDate (const QDateTime &dt) |
void | setData (const QByteArray &data, QStringView fileName={}, QStringView mimeType={}) |
void | setHints (Hints hints) |
void | setUseSeparateProcess (bool separateProcess) |
QString | usedCustomExtractor () const |
Detailed Description
Semantic data extraction engine.
This will attempt to find travel itinerary data in the given input data (plain text, HTML text, PDF documents, etc), and return the extracted JSON-LD data.
Creating Extractors
Extractor API
For adding custom extractors, two parts are needed:
- JSON meta-data describing the extractor and when to apply it, as described in the Extractor documentation.
- An extractor JavaScript file, compatible with QJSEngine.
The extractor script will have access to API defined in the JsApi namespace:
- JsApi::JsonLd: functions for generating JSON-LD data.
- JsApi::Barcode: barcode decoding functions.
- JsApi::BitArray, JsApi::ByteArray for working with binary data.
- JsApi::ExtractorEngine for recursive invokation of the extractor process.
The entry point to the script is specified in the meta-data, its argument depends on the extractor type:
- Plain text extractors are passed a string. If input is HTML or PDF, the string will be the text of the document stripped of all formatting etc.
- HTML extractors are passed a HtmlDocument instance allowing DOM-like access to the document structure.
- PDF extractors are passed a PdfDocument instance allowing access to textual and image content.
- Apple Wallet pass extractors are passed a KPkPass::BoardingPass instance.
- iCalendar event extractors are passed KCalendarCore::Event instances.
- UIC/ERA/VDV/IATA standardized ticket codes are passed as their respective types.
- Binary data is passed as ArrayBuffer.
These functions should return an object or an array of objects following the JSON-LD format defined on schema.org. JsApi::JsonLd provides helper functions to build such objects. If null
or an empty array is returned, the next applicable extractor is run.
Returned objects are then passed through ExtractorPostprocessor which will normalize, augment and validate the data. This can greatly simplify the extraction, as for example the expansion of an IATA BCBP ticket token already fills most key properties of a flight reservation automatically.
Development Tools
For interactive testing during development of new extractors, it is recommended to link (or copy) the JSON meta data and JavaScript code files to the search path for Extractor meta data.
Additionally, there's an interactive testing and inspection tool called kitinerary-workbench
(see https://invent.kde.org/pim/kitinerary-workbench).
Automated Testing
There are a few unit tests for extractors in the kitinerary repository (see autotests/extractordata), however the majority of real-world test data cannot be shared this way, due to privacy and copyright issues (e.g. PDFs containing copyrighted vendor logos and user credit card details). Therefore there is also support for testing against external data (see extractortest.cpp).
External test data is assumed to be in a folder named kitinerary-tests
next to the kitinerary
source folder. The test program searches this folder recursively for folders with the following content and attempts to extract data from each test file in there.
context.eml:
MIME message header data specifying the context in which the test data was received. This typically only needs aFrom:
andDate:
line, but can even be entirely empty (or non-existing) for structured data that does not need a custom extractor. This context information is applied to all tests in this folder.<testname>
.[txt|html|pdf|pkpass|ics|eml|mbox]: The input test data.<testname.extension>
.json: The expected JSON-LD output. If this file doesn't exists it is created by the test program.<testname.extension>
.skip: If this file is present the corresponding test is skipped.
Definition at line 107 of file engine/extractorengine.h.
Member Typedef Documentation
◆ Hints
Definition at line 167 of file engine/extractorengine.h.
Member Enumeration Documentation
◆ Hint
Hints about the document to extract based on application knowledge that can help the extractor.
Enumerator | |
---|---|
ExtractFullPageRasterImages | perform expensive image processing on (PDF) documents containing full page raster images |
ExtractGenericIcalEvents | generate Event objects for generic ical events. |
Definition at line 162 of file engine/extractorengine.h.
Constructor & Destructor Documentation
◆ ExtractorEngine()
ExtractorEngine::ExtractorEngine | ( | ) |
Definition at line 100 of file engine/extractorengine.cpp.
◆ ~ExtractorEngine()
ExtractorEngine::~ExtractorEngine | ( | ) |
Definition at line 108 of file engine/extractorengine.cpp.
Member Function Documentation
◆ barcodeDecoder()
const BarcodeDecoder * ExtractorEngine::barcodeDecoder | ( | ) | const |
Barcode decoder for use by KItinerary::ExtractorDocumentProcessor.
Use this rather than your own instance as it caches repeated attempts to decode the same image.
Definition at line 177 of file engine/extractorengine.cpp.
◆ clear()
void ExtractorEngine::clear | ( | ) |
Resets the internal state, call before processing new input data.
Definition at line 114 of file engine/extractorengine.cpp.
◆ documentNodeFactory()
const ExtractorDocumentNodeFactory * ExtractorEngine::documentNodeFactory | ( | ) | const |
Factory for creating new document nodes.
This is only for use by KItinerary::ExtractorDocumentProcessor instances.
Definition at line 172 of file engine/extractorengine.cpp.
◆ extract()
QJsonArray ExtractorEngine::extract | ( | ) |
Perform the actual extraction, and return the JSON-LD data that has been found.
Definition at line 150 of file engine/extractorengine.cpp.
◆ hints()
ExtractorEngine::Hints ExtractorEngine::hints | ( | ) | const |
The currently set extraction hints.
Definition at line 140 of file engine/extractorengine.cpp.
◆ setAdditionalExtractors()
void ExtractorEngine::setAdditionalExtractors | ( | std::vector< const AbstractExtractor * > && | extractors | ) |
Sets additional extractors to run on the given data.
Extractors are usually automatically selected, this is therefore most likely not needed to be called manually. This mainly exists for the external extractor process.
Definition at line 162 of file engine/extractorengine.cpp.
◆ setContent()
void ExtractorEngine::setContent | ( | const QVariant & | data, |
QStringView | mimeType ) |
Already decoded data to extract from.
- Parameters
-
data Has to contain a object of a supported data type matching mimeType
.
Definition at line 125 of file engine/extractorengine.cpp.
◆ setContext()
void ExtractorEngine::setContext | ( | const QVariant & | data, |
QStringView | mimeType ) |
Provide a document part that is only used to determine which extractor to use, but not for extraction itself.
This can for example be the MIME message part wrapping a document to extract. Using this is not necessary when this document part is already included in what is passed to setContent() already anyway.
Definition at line 130 of file engine/extractorengine.cpp.
◆ setContextDate()
void ExtractorEngine::setContextDate | ( | const QDateTime & | dt | ) |
Set the date the extracted document has been issued at.
This does not need to be perfectly accurate and is used to complete incomplete date information in the document (typically a missing year). This method does not need to be called when setContext is used.
Definition at line 135 of file engine/extractorengine.cpp.
◆ setData()
void ExtractorEngine::setData | ( | const QByteArray & | data, |
QStringView | fileName = {}, | ||
QStringView | mimeType = {} ) |
Set raw data to extract from.
- Parameters
-
data Raw data to extract from. fileName Used as a hint to determine the type, optional and used for MIME type auto-detection if needed. mimeType MIME type of data
, auto-detected if empty.
Definition at line 120 of file engine/extractorengine.cpp.
◆ setHints()
void ExtractorEngine::setHints | ( | ExtractorEngine::Hints | hints | ) |
Set extraction hints.
Definition at line 145 of file engine/extractorengine.cpp.
◆ setUseSeparateProcess()
void ExtractorEngine::setUseSeparateProcess | ( | bool | separateProcess | ) |
Perform extraction of "risky" content such as PDF files in a separate process.
This is safer as it isolates the using application from crashes/hangs due to corrupt files. It is however slower, and not available on all platforms. This is off by default.
Definition at line 157 of file engine/extractorengine.cpp.
◆ usedCustomExtractor()
QString ExtractorEngine::usedCustomExtractor | ( | ) | const |
Returns the extractor id used to obtain the result.
Can be empty if generic extractors have been used. Not supposed to be used for normal operations, this is only needed for tooling.
Definition at line 167 of file engine/extractorengine.cpp.
The documentation for this class was generated from the following files:
Documentation copyright © 1996-2024 The KDE developers.
Generated on Fri Dec 6 2024 12:03:24 by doxygen 1.12.0 written by Dimitri van Heesch, © 1997-2006
KDE's Doxygen guidelines are available online.