KItinerary
Kitinerary provides a JavaScript-engine for writing extractors to parse and output structured data from tickets and other travel documents. This data is later used in many projects to generate useful information for the user.
All extractor scripts are written in JavaScript and stored in /src/lib/scripts
.
How to make your own extractor script
It's highly recommended to use the KItinerary Workbench to develop and test your extractor scripts. How to install and more information, see the KItinerary Workbench README.
Creating a new extractor script
To create a new extractor script, you need to create a new files in the $XDG_DATA_DIRS/kitinerary/extractors
(~/ .local/share/kitinerary/extractors`) directory.
Note: For easier management and later collaboration in Git we recommend linking extractor scripts from the directory to a Git repository (
ln -s $(pwd)/src/lib/scripts ~/.local/share/kitinerary/extractors
).
Script declaration
Kitinerary uses a JSON file to declare the extractor scripts. This file sets filtering rules by which it knows, which extractor to run and defines the script itself.
Note: Multiple extractors can run on a single document, if more than one extractor outputs valid data it will be merged into single output.
Note: Multiple script declarations can exist for one js extractor file. Usefull if there many types of documents but same script can be used to extract data from them.
Extractor declaration
It contains the MIME type of the document that is going to be ingested, filter defining when this extractor script should run, and declaration script and the function that will be called for it.
Extractor scripts are run against a document node if all of the following conditions are met:
- The
mimeType
of the script matches that of the node. - At least one of the extractor
filter
of the script match the node.
Extractor filters
Extractor filters are evaluated against document nodes (content of the document).
Extractor script filter consists of the following four properties:
Scope defines where to match the filter in relation to mimeType. The following values are supported:
Current
: The filter is applied to the node itself (mimeType in filter).Parent
: The filter is applied to the direct parent node of the current node (only one back).Children
: The filter is applied to the direct child nodes of the current node (only one forward).Ancestors
: The filter is applied to all parent nodes of the current node (all the way back).Descendants
: The filter is applied to all child nodes of the current node (all the way forward).
Scope examples
We have an email with PDF, and ticket details inside the PDF. The PDF is a child of the email, and the ticket details are inside the PDF.
Node[0] is the PDF, but we match ticket based on it's parrent which is Email, and we see if it was send from " booking@exampl-operator.com". Which results in the first argument of the parser function becoming the PDF, second argument is the node (PDF) and third argument is the matched the document (message/rfc822).
Examples
Anything attached to an email sent by "booking@example-operator.com". The field matched against here is the From
header of the MIME message.
Documents containing a barcode of the format "F12345678". Note that the scope here is Descendants
rather than Children
as the direct child nodes tend to be the images containing the barcode.
Apple Wallet passes issued by "org.kde.travelAgency".
iCal events with an organizer email address of the "kde.org" domain. Note that the field here accesses a property of a property. This works at arbitrary depth, as long as the corresponding types are introspectable by Qt.
A (PDF) document containing an IATA boarding pass barcode of the airline "AB". Triggering vendor-specific UIC or ERA railway tickets can be done very similarly, matching on the corresponding carrier ids.
A node that has already existing results containing a reservation from "My Transport Operator". This is useful for scripts that want to augment or fix schema.org annotation already provided by the source. Note that the mimeType "application/ld+json" is special here as it doesn't only trigger on the document node content itself, but also matches against the result of nodes of any type.
NOT RECOMMENDED This should be used as a last resort only, as matching against the full PDF document content can be expensive.
PDF documents containing the string "My Ferry Booking" anywhere.
Extractor script
Extractor scripts are run inside a QJSEngine, it isn't a full JS environment, and not everything is supported. There are some additional APIs available to extractor scripts (technical docs can be found here KItinerary::JsApi.
Objects of a document
ExtractorDocumentNode (node)
It's a object that represents a node in the document tree:
content
: Value of the node (eg. text, barcode content, etc)childNodes
: List of child of this node, they are also ExtractorDocumentNode objects.mimeType
: MIME type of the node (eg. text/plain, application/pdf, internal/qimage etc)
Examples
DocumentNode types
Ticket itself can be in different formats, and each format has its own object:
PDF - PDF document
PdfDocument is a object that represents a PDF document; it has the following properties:
text
: Extracts text from the PDF page. If used on root node, it extracts all text from the PDF.pages
: List of pages in the PDFtextInRect
: Extracts text from a given rectangle on the PDF page. Uses normalized coordinates (0-1) in format "Left, Top, Right, Bottom".
More: PdfDocument
Examples
Html - HTML document
HtmlDocument is an object that represents an HTML document consisting HtmlElements; it has the following properties and methods:
rawData()
: Returns the raw textual HTML data.root()
: Returns the root element of the document.eval(xpath)
: Evaluates an XPath expression relative to the document root and returns matching elements.
HtmlElement represents an element within an HTML document; it has the following properties and methods:
name
: Returns the element name (tag).isNull
: Checks if the element is null/invalid.attribute
: Returns the value of the specified attribute.hasAttribute
: Checks whether an attribute with the given name exists.attributes
: Returns a list of all attributes of this element.content
: Returns the immediate text content of this element (trimmed of whitespace).recursiveContent
: Returns the text content of this element and all its children.parent
: Returns the parent element of this node.firstChild
: Returns the first child element of this node.nextSibling(: Returns the next sibling element of this node. -
eval`: Evaluates an XPath expression relative to this element.
More: HtmlDocument, HtmlElement
Examples
PKPASS
It's a object of fields inside PKPASS:
field[X]
: Object with labels and valuesExample - pkpass
function main(pkpass, node) {// pass.json has "boardingPass" with keys "depar" "arrir" "arrirTime" "deparTime" "code"var res = node.result[0];var f = JsonLd.newFlightReservation(); // https://schema.org/FlightReservationf.reservationFor.departureAirport.name = pass.field["depar"].label;f.reservationFor.arrivalAirport.name = pass.field["arrir"].label;f.reservationFor.departureTime = JsonLd.toDateTime(pass.field["deparTime"].value,"hh:mm dd.MM.yyyy","en",);f.reservationFor.arrivalTime = JsonLd.toDateTime(pass.field["arrirTime"].value,"hh:mm dd.MM.yyyy","en",);f.reservationFor.airline.iataCode = "KD";f.reservationFor.flightNumber = pass.field["code"].label;return f; // Returns the flight reservation object later used by other apps}
Additional API available to extractor scripts
JSON-LD API
API for supporting schema.org output:
JsonLd
: factory functions for schema.org objects, date/time parsing, etc
More: JsonLd
Examples
ByteArray, BitArray, Barcode
API for handling specific types of input data:
ByteArray
: functions for dealing with byte-aligned binary data, including decompression, Base64 decoding, Protcol Buffer decoding, etc.BitArray
: functions for dealing with non byte-aligned binary data, such as reading numerical data at arbitrary bit offsets. Often used if binary data is with nonstandard encoding (eg. 6bit per character).Barcode
: functions for manual barcode decoding. This should be rarely needed nowadays, with the extractor engine doing this automatically and creating corresponding document nodes.
Examples
Extractor API
API for interacting with the extractor engine itself:
ExtractorEngine
: Allows to recursively perform extraction. It can be useful for elements that need custom decoding in an extractor script first, but that contain otherwise generally supported data formats. Standard barcodes encoded in URL arguments are such an example.
More: ExtractorEngine
Examples
Extractor scripts
The script entry point is called with three arguments:
- The first argument is the content of the node that is processed. The data type of that argument depends on the node type as described in the document model section above. This is usually what extractor script are most concerned with.
- The second argument is the document node being processed (KItinerary::ExtractorDocumentNode, see example under). It can be useful to access already extracted results on a node (e.g. coming from generic extraction) in order to augment those.
- The third argument is the document node that matched the filter. This can be the same as the second argument (for filters with
scope
= Current), but it doesn't have to be. It is most useful when triggering on descendant nodes such as barcodes, the content of which will then be incorporated into the extraction result by the script.
Output of your JS function should be:
- A JS object following the schema.org ontology (JsonLd) with a single extraction result.
- A JS array containing one or more schema.org/JsonLd objects. Useful if a ticket document has multiple tickets.
Script errors and empty array is considered as "[]" (aka. nothing was returned).
Examples
Let's assume we want to create an extractor script for a railway ticket which comes with a simple tabular layout for a single leg per page, and contains a QR code with a 10 digit number for each leg.
As a filter we'd use something similar as example 2 above, triggering on the barcode content.
The above example produces and entirely new result. Another common case are scripts that merely augment an existing result. Let's assume an Apple Wallet pass for a flight, the automatically extracted result is correct but misses the boarding group. The filter for this would be similar to example 4 above, triggering on the pass issuer.
Documentation copyright © 1996-2025 The KDE developers.
Generated on Fri May 2 2025 11:54:59 by doxygen 1.13.2 written by Dimitri van Heesch, © 1997-2006
KDE's Doxygen guidelines are available online.