|ExtractorEngine (ExtractorEngine &&) noexcept|
|ExtractorEngine (const ExtractorEngine &)=delete|
|void||setCalendar (const QSharedPointer< KCalCore::Calendar > &calendar)|
|void||setContent (KMime::Content *content)|
|void||setContext (KMime::Content *context)|
|void||setContextDate (const QDateTime &dt)|
|void||setExtractors (std::vector< const Extractor * > &&extractors)|
|void||setHtmlDocument (HtmlDocument *htmlDoc)|
|void||setPass (KPkPass::Pass *pass)|
|void||setPdfDocument (PdfDocument *pdfDoc)|
|void||setText (const QString &text)|
Unstructured data extraction engine.
This will apply the given Extractor instance to the given input data (plain text, HTML text, PDF documents, etc), and return the extracted JSON-LD data.
For adding custom extractors, two parts are needed:
- JSON meta-data describing the extractor and when to apply it, as described in the Extractor documentation.
The extractor script will have access to API defined in the JsApi namespace:
- JsApi::Context: information about the input data being processed.
- JsApi::JsonLd: functions for generating JSON-LD data.
- JsApi::Barcode: barcode decoding functions.
The entry point to the script is specified in the meta-data, its argument depends on the extractor type:
- Plain text extractors are passed a string. If input is HTML or PDF, the string will be the text of the document stripped of all formatting etc.
- HTML extractors are passed a HtmlDocument instance allowing DOM-like access to the document structure.
- PDF extractors are passed a PdfDocument instance allowing access to textual and image content.
- Apple Wallet pass extractors are passed a KPkPass::BoardingPass instance.
- iCalendar event extractors are passed KCalCore::Event instances.
These functions should return an object or an array of objects following the JSON-LD format defined on schema.org. JsApi::JsonLd provides helper functions to build such objects. If
null or an empty array is returned, the next applicable extractor is run.
Returned objects are then passed through ExtractorPostprocessor which will normalize, augment and validate the data. This can greatly simplify the extraction, as for example the expansion of an IATA BCBP ticket token already fills most key properties of a flight reservation automatically.
Additionally, there's an interactive testing and inspection tool called
kitinerary-workbench (see https://phabricator.kde.org/source/kitinerary-workbench/).
There are a few unit tests for extractors in the kitinerary repository (see autotests/extractordata), however the majority of real-world test data cannot be shared this way, due to privacy and copyright issues (e.g. PDFs containing copyrighted vendor logos and user credit card details). Therefore there is also support for testing against external data (see extractortest.cpp).
External test data is assumed to be in a folder named
kitinerary-tests next to the
kitinerary source folder. The test program searches this folder recursively for folders with the following content and attempts to extract data from each test file in there.
context.eml:MIME message header data specifying the context in which the test data was received. This typically only needs a
Date:line, but can even be entirely empty (or non-existing) for structured data that does not need a custom extractor. This context information is applied to all tests in this folder.
<testname>.[txt|html|pdf|pkpass|ics|eml|mbox]: The input test data.
<testname.extension>.json: The expected JSON-LD output. If this file doesn't exists it is created by the test program.
<testname.extension>.skip: If this file is present the corresponding test is skipped.
Member Function Documentation
|void ExtractorEngine::setContent||(||KMime::Content *||content||)|
|void ExtractorEngine::setContext||(||KMime::Content *||context||)|
Sets the MIME part the document we try to extract comes from.
Use this for documents received by email, to provide additional hints for the extraction. Calling this method is not necessary when using setContent, only when using any of the other content setter methods directly.
|void ExtractorEngine::setContextDate||(||const QDateTime &||dt||)|
Set the date the extracted document has been issued at.
This does not need to be perfectly accurate and is used to complete incomplete date information in the document (typically a missing year). This method does not need to be called when setContext is used.
|void ExtractorEngine::setExtractors||(||std::vector< const Extractor * > &&||extractors||)|
|void ExtractorEngine::setHtmlDocument||(||HtmlDocument *||htmlDoc||)|
|void ExtractorEngine::setPass||(||KPkPass::Pass *||pass||)|
|void ExtractorEngine::setPdfDocument||(||PdfDocument *||pdfDoc||)|
|void ExtractorEngine::setText||(||const QString &||text||)|
The documentation for this class was generated from the following files: