KSpeech

A D-Bus Interface to Text-to-Speech Service

Version: 2.0 Draft 1

Introduction to the KSpeech D-Bus Interface

KSpeech is a D-Bus interface for applications desiring to speak text. Applications may speak text by sending D-Bus messages to application "org.kde.kttsd", object path "/KSpeech", interface "org.kde.KSpeech".

KTTSD – the KDE Text-to-Speech Daemon – is the program that supplies the services in the KDE Text-to-Speech API.

Warning: The KSpeech interface is still being developed and is likely to change in the future.

API Reference

Features

Priority system for Screen Readers, warnings and messages, while still playing regular texts.
Long text is parsed into sentences. User may backup by sentence, replay, pause, and stop playing.
Handles multiple speaking applications. Speech requests are treated like print jobs. Jobs may be created, stopped, paused, resumed, and deleted.
Speak contents of clipboard.
Speak contents of a file.
Speak KDE notifications.
Plugin-based job filtering permits substitution for misspoken words, abbreviations, etc., transformation of XML or XHTML to SSML, and automatic choice of appropriate synthesis engine.

Requirements

You may build any KDE application to use KSpeech, since the interface is in kdelibs, but the kdeaccessibility package must be installed for KTTS to function.

You will need a speech synthesis engine, such as Festival. See the KTTS Handbook for the latest information on installing and configuring speech engines and voices with KTTS.

Design Goals

The KDE Text-to-Speech API is designed with the following goals:

Support the features enumerated above.
Plugin-based architecture for support of a wide variety of speech synthesis engines and drivers.
Permit generation of speech from the command line (or via shell scripts) using the KDE D-Bus utilities.
Provide a lightweight and easily usable interface for applications to generate speech output.
Applications need not be concerned about contention over the speech device.
Provide limited support for speech markup languages, such as Speech Markup Meta-language (SMML).
Provide limited support for embedded speech markers.
Asynchronous to prevent system blocking.
Plugin-based audio architecture. Currently supports ALSA or Phonon.

Architecturally, applications interface with KTTSD, which performs queuing, speech job management, plugin management and sentence parsing. KTTSD interfaces with a KTTSD speech plugin(s), which then interfaces with the speech engine(s) or driver(s).

         application
              ^
              |  via D-Bus (the KDE Text-to-Speech API)
              v
            kttsd
              ^
              |  KTTSD plugin API
              v
         kttsd plugin
              ^
              |
              v
        speech engine

The KTTSD Plugin API is documented in PluginConf in the kdeaccessibility module.

There is a separate GUI application, called kttsmgr, for providing KTTSD configuration and job management.

Speech Jobs and Priorities

When a request for speech is made, usually via the OrgKdeKSpeechInterface::say method, a speech job is queued. The order by which jobs are spoken is determined by their priority (in decreasing priority):

Screen Reader Output
Warnings
Messages
Text Jobs

Screen Reader output pre-empts any other speech in progress, including other Screen Reader outputs, i.e., it is not a queue. This is reserved for use by Screen Readers.

Warnings take priority over Messages, which take priority over text jobs. Warnings and Messages are spoken when the currently-speaking sentence of a text job is finished.

Text Jobs are the lowest priority are used for long text or general TTS.

The priority of jobs is determined by the OrgKdeKSpeechInterface::setDefaultPriority method. After setting the priority, all subsequent say commands are queued at that priority.

Within a job, the application (and user via the kttsmgr GUI), may back up or advance by sentence, or rewind to the beginning.

See also: OrgKdeKSpeechInterface::moveRelSentence.

All jobs may be paused, resumed or deleted (stopped) from the queue.

See also: OrgKdeKSpeechInterface::pause, OrgKdeKSpeechInterface::resume, OrgKdeKSpeechInterface::removeJob, and OrgKdeKSpeechInterface::removeAllJobs.

D-Bus Command-line Interface

Examples of using the KSpeech interface via command-line D-Bus follow.

To create a text job to be spoken

     qdbus org.kde.kttsd "/KSpeech" say <text> <options>

where <text> is the text to be spoken, and <options> is one of the options defined in the KSpeech::SayOptions enum. Normally, this can be entered as zero.

Example.

     qdbus org.kde.kttsd "/KSpeech" say "Hello World." 0

To stop speaking and delete the last queued job.

     qdbus org.kde.kttsd "/KSpeech" removeJob 0

Depending upon the speech plugin used, speaking may not immediately stop. The zero argument in this case is the job number to be removed. Zero means "the current job".

Calling KTTSD from a Program

There are two methods of making D-Bus calls from your application to KTTSD.

Manually code them using QDBusInterface object. See kdebase/konqueror/kttsplugin/khtmlkttsd.cpp for an example. This method is recommended if you want to make a few simple calls to KTTSD.
Use OrgKdeKSpeechInterface as described below. This method generates the marshalling code for you and is recommended for a more complex speech-enabled applications. kcmkttsmgr in the kdeaccessibility module is an example that uses this method.

Manual code

Sending a text job to KSpeech is very simple. Sample code:

       #include <QtCore/QtDBus>
       
       QDBusInterface kspeech("org.kde.kttsd", "/KSpeech", "org.kde.KSpeech");
       // Send a string to KTTS and get back a job number.
       kspeech.call("setApplicationName", "MyApp");
       QDBusReply<int> reply = kspeech.call("say", "Hello World.", 0);

Notice the call to OrgKdeKSpeechInterface::setApplicationName. All applications should do this before submitting any jobs so that a friendly name will appear in kttsmgr.

Here's a slightly more complicated sample that sets the job priority to Message and specifies a language code ("en" or "de", for example):

     #include <QtCore/QtDBus>
     #include <kspeech>

     bool kttsdSay (const QString &text, const QString &language) {
       // TODO: Would be better to save off this QDBusInterface pointer and
       // set applicationName and defaults only once.
       QDBusInterface kspeech("org.kde.kttsd", "/KSpeech", "org.kde.KSpeech");
       kspeech.call("setApplicationName", "KMouth");
       kspeech.call("setDefaultTalker", language);
       kspeech.call("setDefaultPriority", KSpeech::jpMessage);
       QDBusReply<int> reply = kspeech.call("say", text, 0);
        return (reply != 0);
    }

It is not necessary to call setDefaultTalker and setDefaultPriority prior to each call to say. These settings remain in effect for all subsequent calls to say.

Using OrgKdeKSpeechInterface

Begin by adding the following command to your CMakeLists.txt file so that the build system will generate kspeechinterface.h and kspeechinterface.cpp for you from the org.kde.KSpeech.xml interface definition file using qdbusxml2cpp utility.

     qt4_add_dbus_interfaces(myapp_SRCS org.kde.KSpeech.xml )

Substitute your application's SRCS target for "myapp".

TODO: At present, to make the command above work, you must copy org.kde.KSpeech.xml from kdelibs/interfaces/kspeech to your source directory, but this will change in the future.

In your application's .h file, add the following code to declare a variable to hold an instance of OrgKdeKSpeechInterface object. You can also declare slots to receive signals from KTTSD. Typically, you will do this as part of a class.

    #include <QtCore/QObject>
    #include "kspeechinterface.h"

    class MyClass : public QObject {
    
    Q_OBJECT
    
    public:
        MyClass(QObject *parent=0);
        ~MyClass();
         
    protected Q_SLOTS:
        Q_SCRIPTABLE void jobStateChanged(const QString &appId, int jobNum, int state);

    private:
       org::kde::KSpeech* m_kspeech;
    }

In the .cpp file, determine if KTTSD is running, create the OrgKdeKSpeechInterface object, and connect signals to slots like this:

    MyClass::MyClass(QObject *parent) : QObject(parent), m_kspeech(0) { }
    
    bool MyClass::isKttsdRunning()
    {
        bool isRunning = (QDBus::sessionBus().interface()->isServiceRegistered("org.kde.kttsd"));
        if (isRunning) {
            if (!m_kspeech) {
                m_kspeech = new OrgKdeKSpeechInterface("org.kde.kttsd", "/KSpeech", QDBus::sessionBus());
                m_kspeech->setParent(this);
                m_kspeech->setApplicationName("MyApp");
                connect(m_kspeech, SIGNAL(jobStateChanged(const QString&, int, int)),
                     this, SLOT(jobStateChanged(const QString&, int, int)));
            }
        } else {
            delete m_kspeech;
            m_kspeech = 0;
        }
        return isRunning;
    }

Notice that the application sets a friendly display name for itself. If this is not done, the D-Bus connection name (example: ":1.16") will be shown in kttsmgr.

To submit a simple job of priority Text using the default talker:

    if (m_kspeech)
        int jobNum = m_kspeech->say("Hello World", 0);

The second argument to "say" is used to give hints to KTTSD about the contents of the text. See KSpeech::SayOptions.

Talkers determine the synthesizer and language that will be used for TTS. To change the talker to a German-speaking one:

    if (m_kspeech)
        m_kspeech->setDefaultTalker("de");

All subsequent calls to say will use this talker. See OrgKdeKSpeechInterface::getTalkerCodes.

If you want to detect if KTTSD is installed without starting it, use this code.

     KTrader::OfferList offers = KTrader::self()->query("DBUS/Text-to-Speech", "Name == 'KTTSD'");
     if (offers.count() > 0)
     {
       // KTTSD is installed.
     }

Typically, you would do this to hide a menu item or button if KTTSD is not installed.

If KTTSD is not running, you can start it:

TODO: Use D-Bus start service or KTrader?

To detect if KTTSD has exited, you can use the OrgKdeKSpeechInterface::kttsdExiting signal, or you connect the D-Bus serviceUnregistered signal, like this:

     connect (QDBus::sessionBus().interface(), SIGNAL(serviceUnregistered(const QString&)),
        this, SLOT(slotServiceUnregistered(const QString&)));

    void MyClass::slotServiceUnregistered(const QString& serviceName)
    {
        if (serviceName == "org.kde.kttsd") {
            delete m_kspeech;
            m_kspeech = 0;
        }
    }

Signals Emitted by KTTSD

KTTSD emits a number of D-Bus signals, which provide information about sentences spoken, jobs started, paused, interrupted, finished, or deleted and markers seen. In general, these signals are broadcast to any application that connects to them. Applications should check the appId argument to determine whether the signal belongs to them or not.

    void MyClass::jobStateChanged(const QString &appId, int jobNum, int state)
    {
        if (appId != QDBus::sessionBus().baseService()) return;
        if (KSpeech::jsFinished == state)
            // jobNum has finished speaking.
    }

See also: OrgKdeKSpeechInterface::jobStateChanged; OrgKdeKSpeechInterface::marker; OrgKdeKSpeechInterface::kttsdExiting

Talkers, Talker Codes, and Plugins

Many of the methods permit you to specify a desired "talker". This may be a simple language code, such as "en" for English, "es" for Spanish, etc. Code as "" to use the default configured talker.

Within KttsMgr, the user has the ability to configure more than one talker for each language, with different voices, genders, volumes, and talking speeds.

Talker codes serve two functions:

They identify configured plugins, and
They provide a way for applications to specify the desired speaking attributes that influence the choice of plugin to speak text.

A Talker Code consists of a series of XML tags and attributes. An example of a full Talker Code with all attributes specified is

     <voice lang="en" name="kal" gender="male"/>
     <prosody volume="soft" rate="fast"/>
     <kttsd synthesizer="Festival" />

(The voice and prosody tags are adapted from the W3C Speech Synthesis Markup Language (SSML) and Java Speech Markup Language (JSML). The kttsd tag is an extension to the SMML and JSML languages to support named synthesizers and text encodings.) KTTS doesn't really care about the voice, prosody, and kttsd tags. In fact, they may be omitted and just the attributes specified. The example above then becomes

lang="en" name="kal" gender="male" volume="soft" rate="fast" synthesizer="Festival"

The attributes may be specified in any order.

For clarity, the rest of the discussion will omit the voice, prosody, and kttsd tags.

The attributes that make up a talker code are:

lang. Language code and optional country code. Examples: en, es, en_US, en_GB. Codes are case in-sensitive and hyphen (-) or underscore (_) may be used to separate the country code from the language code.
synthesizer. The name of the synthesizer (plugin) used to produce the speech.
gender. May be either "male", "female", or "neutral".
name. The name of the voice code. The choice of voice codes is synthesizer-specific.
volume. May be "loud", "medium", or "quiet". A synonym for "quiet" is "soft".
rate. May be "fast", "medium", or "slow".

Each plugin, once it has been configured by a user in kttsmgr, returns a fully-specified talker code to identify itself. If the plugin supports it, the user may configure another instance of the plugin with a different set of attributes. This is the difference between a "plugin" and a "talker". A talker is a configured instance of a plugin. Each plugin (if it supports it) may be configured as multiple talkers.

When the user configures KTTSD, she configures one or more talkers and then places them in preferred order, top to bottom in kttsmgr. In effect, she specifies her preferences for each of the talkers.

When applications specify a talker code, they need not (and typically do not) give a full specification. An example of a talker code with only some of the attributes specified might be

lang="en" gender="female"

If the talker code is not in XML attribute format, it assumed to be a lang attribute. So the talker code

is interpreted as

lang="en"

When a program requests a talker code in calls to OrgKdeKSpeechInterface::setDefaultTalker, KTTSD tries to match the requested talker code to the closest matching configured talker.

The lang attribute has highest priority (attempting to speak English with a Spanish synthesizer would likely be unintelligible). So the language attribute is said to have "priority". If an application does not specify a language attribute, a default one will be assumed. The rest of the attributes are said to be "preferred". If KTTSD cannot find a talker with the exact preferred attributes requested, the closest matching talker will likely still be understandable.

An application may specify that one or more of the attributes it gives in a talker code have priority by preceding each priority attribute with an asterisk. For example, the following talker code

lang="en" gender="*female" volume="soft"

means that the application wants to use a talker that supports American English language and Female gender. If there is more than one such talker, one that supports Soft volume would be preferred. Notice that a talker configured as English, Male, and Soft volume would not be picked as long as an English Female talker is available.

The algorithm used by KTTSD to find a matching talker is as follows:

If language code is not specified by the application, assume default configured by user. The primary language code automatically has priority.
(Note: This is not yet implemented.) If there are no talkers configured in the language, KTTSD will attempt to automatically configure one (see automatic configuraton discussion below)
The talker that matches on the most priority attributes wins.
If a tie, the one that matches on the most preferred attributes wins.
If there is still a tie, the one nearest the top of the kttsmgr display (first configured) will be chosen.

Language codes actually consist of two parts, a language code and an optional country code. For example, en_GB is English (United Kingdom). The language code is treated as a priority attribute, but the country code (if specified) is treated as preferred. So for example, if an application requests the following talker code

lang="en_GB" gender="male" volume="medium"

then a talker configured as lang="en" gender="male" volume="medium" would be picked over one configured as lang="en_GB" gender="female" volume="soft", since the former matches on two preferred attributes and the latter only on the preferred attribute GB. An application can override this and make the country code priority with an asterisk. For example,

lang="*en_GB" gender="male" volume="medium"

To specify that American English is priority, put an asterisk in front of en_US, like this.

lang="*en_US" gender="male" volume="medium"

Here the application is indicating that a talker that speaks American English has priorty over one that speaks a different form of English.

(Note: Not yet implemented). If a language code is specified, and no plugin is currently configured with a matching language code, KTTSD will attempt to automatically load and configure a plugin to support the requested language. If there is no such plugin, or there is a plugin but it cannot automatically configure itself, KTTSD will pick one of the configured plugins using the algorithm given above.

Notice that KTTSD will always pick a talker, even if it is a terrible match. (The principle is that something heard is better than nothing at all. If it sounds terrible, user will change his configuration.) If an attribute is absolutely mandatory – in other words the application must speak with the attribute or not at all – the application can determine if there are any talkers configured with the attribute by calling OrgKdeKSpeechInterface::getTalkerCodes, and if there are none, display an error message to the user.

Applications can implement their own talker-matching algorithm by calling getTalkerCodes, then finding the desired talker from the returned list. When the full talker code is passed in, KKTSD will find an exact match and use the specified talker.

If an application requires a configuration that user has not created, it should display a message to user instructing them to run kttsmgr and configure the desired talker. (This must be done interactively because plugins often need user assistance locating voice files, etc.)

The above scheme is designed to balance the needs of applications against user preferences. Applications are given the control they might need, without unnecessarily burdening the application author. If you are an application author, the above discussion might seem overly complicated. It isn't really all that complicated. Here are rules of thumb:

It is legitimate to not call OrgKdeKSpeechInterface::setDefaultTalker, in which case, the user's default talker will be used.
If you know the language code, give that in the talker code, otherwise leave it out.
If there is an attribute your application requires for proper functioning, specify that with an asterisk in front of it. For example, your app might speak in two different voices, Male and Female. (Since your app requires both genders, call getTalkers to determine if both genders are available, and if not, advise user to configure them. Better yet, give the user a choice of available distinquishing attributes (loud/soft, fast/slow, etc.)
If there are other attributes you would prefer, specify those without an asterisk, but leave them out if it doesn't really make any difference to proper functioning of your application. Let the user decide them when they configure KTTS.

One final note about talkers. KTTSD does talker matching for each sentence spoken, just before the sentence is sent to a plugin for synthesis. Therefore, the user can change the effective talker in mid processing of a text job by changing his preferences, or even deleting or adding new talkers to the configuration.

Speech Markup

Note: Speech Markup is not yet fully implemented in KTTSD.

The text passed in a call to OrgKdeKSpeechInterface::say may contain speech markup, provided that the plugin the user has configured supports that markup. The markup languages and plugins currently supported are:

Speech Synthesis Markup language (SSML): Festival and Hadifix.

This may change in the future as synthesizers improve.

Before including markup in the text sent to kttsd, the application should query whether the currently-configured plugin supports the markup language by calling OrgKdeKSpeechInterface::getTalkerCapabilities1.

It it does not support the markup, it will be stripped out of the text.

Support for Markers

Note: Markers are not yet fully implemented in KTTSD. At present, only mtSentenceBegin and mtSentenceEnd are emitted.

When using a speech markup language, such as SSML, the application may embed named markers into the text. If the user's chosen speech plugin supports markers, KTTSD will emit OrgKdeKSpeechInterface::marker signals when the speech engine encounters the marker. Depending upon the speech engine and plugin, this may occur either when the speech engine encounters the marker during synthesis from text to speech, or when the speech is actually spoken on the audio device. The calling application can call the OrgKdeKSpeechInterface::getTalkerCapabilities1 method to determine if the currently configured plugin supports markers or not.

Sentence Parsing

Not all speech engines provide robust capabilities for stopping synthesis that is in progress. To compensate for this, KTTSD parses jobs given to it by the OrgKdeKSpeechInterface::say method into sentences and sends the sentences to the speech plugin one at a time. In this way, should the user wish to stop the speech output, they can do so, and the worst that will happen is that the last sentence will be completed. This is called Sentence Boundary Detection (SBD).

Sentence Boundary Detection also permits the user to rewind by sentences.

The default sentence delimiter used for plain text is as follows:

A period (.), question mark (?), exclamation mark (!), colon (:), or semi-colon (;) followed by whitespace (including newline), or
Two newlines in a row separated by optional whitespace, or
The end of the text.

When given text containing speech markup, KTTSD automatically determines the markup type and parses based on the sentence semantics of the markup language.

An application may change the sentence delimiter by calling OrgKdeKSpeechInterface::setSentenceDelimiter prior to calling OrgKdeKSpeechInterface::say. Changing the delimiter does not affect other applications.

Jobs of priority Screen Reader Output are not split into sentences. For this reason, applicatons should avoid sending long messages of priority KSpeech::jpScreenReaderOutput.

Sentence Boundary Detection is implemented as a plugin SBD filter. See filters for more information.

Filters

Users may specify filters in the kttsmgr GUI. Filters are plugins that modify the text to be spoken or change other characteristics of jobs. Currently, the following filter plugins are available:

String Replacer. Permits users to substitute for mispoken words, or vocalize chat emoticons.
XML Transformer. Given a particular XML or XHTML format, permits conversion of the XML to SSML (Speech Synthesis Markup Language) using XSLT (XML Style Language - Transforms) stylesheets.
Talker Chooser. Permits users to redirect jobs from one configured Talker to another based on the contents of the job or application that sent it.

Additional plugins may be available in the future.

In additional to these regular filters, KTTS also implements Sentence Boundary Detection (SBD) as a plugin filter. See Sentence Parsing for more information.

Applications may control filtering by calling OrgKdeKSpeechInterface::setFilteringOn.

Note: SBD filters are never applied to Screen Reader jobs.

Author(s):: Gary Cramblitt garyc.nosp@m.ramb.nosp@m.litt@.nosp@m.comc.nosp@m.ast.n.nosp@m.et

Maintainer(s):: Gary Cramblitt garyc.nosp@m.ramb.nosp@m.litt@.nosp@m.comc.nosp@m.ast.n.nosp@m.et

License(s):: LGPLv2

KDE's Doxygen guidelines are available online.

KSpeech

Introduction to the KSpeech D-Bus Interface

API Reference

Features

Requirements

Design Goals

Speech Jobs and Priorities

D-Bus Command-line Interface

Calling KTTSD from a Program

Manual code

Using OrgKdeKSpeechInterface

Signals Emitted by KTTSD

Talkers, Talker Codes, and Plugins

Speech Markup

Support for Markers

Sentence Parsing

Filters

KSpeech

kdelibs API Reference

Search