Speech Recognition
UT2004 implements the [Microsoft Speech API] for text to voice and speech recognition. (Windows 32bit only)
This document will explain how to use the speech recognition feature.
Setup
To enable speech recognition you will have to make sure the following is set in your UT2004.ini file
[WinDrv.WindowsClient] UseSpeechRecognition=True
By default this has been set.
Next you will have to create a grammar file (the format will be explained below). This file should go into the folder configured for the Speech files, by default this is ../Speech
.
It's set by the SpeechPath
in the [Core.System]
.
Now you have to make sure your grammar file is loaded for your mod. For game types you do this by overwriting the event SetGrammar()
with the following
event SetGrammar() { LoadSRGrammar("MyGrammar"); // you grammar file without extention }
You can set this at any time in the game. Every subsequent call to LoadSRGrammar()
will overwrite the previous grammar.
Important note: for the best performance you will have to train the speech recognition engine. This can take quite a long time, but more training will give you better results. To train the engine use the "speech" applet in the windows Control Panel.
Usage
Speech recognition is available at any time, even if there is no voice replication. Speech will be recorded when the bVoiceTalk
input is activated. By default an alias has been created for easy use: VoiceTalk
. The f
key is bound to the VoiceTalk command.
For a more fool proof usage, use the command
console command, it will execute the recognised string , not the raw string (unless you change the base code for that command).
Grammar file
This file defines how speech is recognised and what commands sequences are build. The grammar file is a simple WikiPedia:XML like document.
Note: just like with any XML documented you can use <!-- ... -->
for comments.
Only the most important features will be discussed. For more information about the grammar, download the SAPI SDK.
<GRAMMAR>
A grammar file will always start and end with a <GRAMMAR>...</GRAMMAR>
tag. A grammar needs at least one <RULE>...</RULE>
element.
The only usefull property of this tag is:
- LANGID
- (optional) This is the language ID for this grammar. At the moment the engine only supports US English (ID:
409
)
Child elements:
- RULE (1 or more)
<RULE>
This is the core element of the grammar. It defines a rule to be recognised. You can have more than one rule per grammar, but only one can be active.
Properties:
- NAME
- (required) the name of this rule. It will be used for referencing to this rule (see RULEREF)
- TOPLEVEL
- (optional) set this to "active" to make this rule the active rule. Only the active rule will be recognised for a start of a sequence.
Child elements:
- RULEREF
- PHRASE
- LIST
- OPT
- WILDCARD
- DICTATION
<LIST> or <L>
A list tag allows you to create a list of possible phrase paths. Each element within the list will be chosen exlucively.
For example in:
<list> <p>one</p> <p>two</p> </list>
Either one
or two
will be accepted.
- PROPNAME
- (optional) this will set the default value of PROPNAME for the children (see PHRASE for a description)
Child elements:
- PHRASE
- LIST
- RULEREF
- WILDCARD
- DICTATION
<PHRASE> or <P>
This is the most important element. It defined a text to be recognised in order to follow the rest of the path. It can either be a single world or a couple of works to be recognised. How to do it depends on the purpose.
- DISP
- (optional) text to display instead of the text within the tag
- MAX
- (optional) maximum occurences of the phrase sequence, by default equal to MIN.
"INF"
means infinite. - MIN
- (optional) minimal occurences of the phrase, by default 1
- PRON
- (optional) pronunciation, some words might not be accepted by the engine, this will allow you to define a sounds like for the text
- PROPNAME
- (optional) this is the token\text used to identify this recognised string with. It's important to parse the recognised string within the UnrealEngine.
Child elements:
- RULEREF
- PHRASE
- OPT
- LIST
- WILDCARD
- DICTATION
<OPT> or <O>
This defines an optional element. It's accepted when it's there, or not.
For example
<p>hello</p> <opt>world</opt>
Either hello
or hello world
is accepted.
This tag has the same properties as PHRASE
Child elements:
- RULEREF
- PHRASE
- OPT
- LIST
- WILDCARD
- DICTATION
<WILDCARD>
This is a wildcard for zero or more words, everything that matches this space will be ignored in the output.
For example:
<p>bite my <wildcard /> metal ass</p>
This will match "bite my metal ass", "bite my shiney metal ass" or "bite my colosal shiney metal ass". However the output will always be "bite my metal ass"
<DICTATION>
This is a very tricky tag, the result may not be what you want it to be. It's pretty much like a WILDCARD except that the matching text won't be ignored.
Learning the engine is very important for this tag, because it tries to guess the words you said. It can easily confuse the word "knife" with "life".
- MAX
- (optional) maximum words
- MIN
- (optional) minimum words
- PROPNAME
- (optional) used to represent this element in the recognized string.
<RULEREF>
This tag will allow you to refer to a different RULE in the grammar. This way you can re-use rules in various phrases.
- NAME
- (required) the name of the rule you are refering to
- PROPNAME
- (optional)
Example Grammar
Let's say we have the following grammar (based on the BR.xml grammar of UT2004):
<GRAMMAR LANGID="409"> <RULE NAME="BR" TOPLEVEL="ACTIVE"> <P MIN="1" MAX="3"> <RULEREF NAME="PLAYER"/> </P> <L> <P> <P PROPNAME="DEFEND">defend</P> <O>the</O> <O>ball</O> </P> <P> <L PROPNAME="ATTACK"> <P>take</P> <P>attack</P> <P>get</P> </L> <O>the</O> <O>ball</O> </P> <P PROPNAME="COVER">cover me</P> <P PROPNAME="FREELANCE">freelance</P> <P PROPNAME="TAUNT">taunt</P> </L> </RULE> <RULE NAME="PLAYER" > <L> <P PROPNAME="ALPHA">alpha</P> <P PROPNAME="BRAVO">bravo</P> <P PROPNAME="CHARLIE">charlie</P> </L> <O>and</O> </RULE> </GRAMMAR>
This grammar requires you to address at least one bot first (alpha, bravo or charlie), with a max of 3 bots. The "and" after the bot's name is optional, so you can say "alpha bravo" or "alpha and bravo".
Next comes a command, this is a list. Basically it comes down to one of these commands: defend, attack, cover me, freelance and taunt.
For defent you could also say "defend the ball" or just "defend ball". For the attack command either "take", "attack" or "get" is accepted.
The propnames will generate a recognized string (see below). When you say "alpha take the ball" it will generate the string "ALPHA ATTACK".
UnrealScript
When a phrase you spoke is accepted by the engine it will call the following event in the player controller
event VoiceCommand( string RecognizedString, string RawString )
- RecognizedString
- contains the
PROPNAME
values as constructed via the grammar - RawString
- contains the actual data received.
If you say "alpha take the ball" the RecognizedString will be "ALPHA ATTACK"
and the RawString will be "alpha take the ball".
When you use a <WILDCARD />
that part will be ignored in the RawString
. If whatever is said is important use the <DICTATION>
tag instead.
By default the RecognizedString
string is forwarded to GameInfo's function ParseVoiceCommand( PlayerController Sender, string RecognizedString )
for further processing. See the TeamGame class for an extensive example.
Quirks
Nesting elements will duplicate the PROPNAME is the RecognizedString string.
For example:
<p PROPNAME="use">use <RULEREF NAME="ITEM" PROPNAME="item1" /> on <RULEREF NAME="ITEM" PROPNAME="item1" /></p>
Will generate:
use item1 use item2
To fix this change it to something like
<p><p PROPNAME="use">use</p><RULEREF NAME="ITEM" PROPNAME="item1" /> on <RULEREF NAME="ITEM" PROPNAME="item1" /></p>
This will generate:
use item1 item2
In case of "use" we could just ignore the "on" part since it's redundant at that point.
Related Topics
Discussion
El Muerte: I promised some cool stuff when I wrote this document. However when I was working on these cool things I stumbled upon a couple of issues that I could not resolve.
First, the <dictation /> part of the speech recoginition works, but does not work well. Even after training the speech engine for more than an hour it still gave a lot of errors in the words I was trying. When using script grammar rules (e.g. <phrase> tags) you won't have any problems with this since the engine will try to match the spoken text with the limited set of possible phrases provided by the grammar (ofcourse "knife" and "life" would still be difficult).
Secondly I was trying to implement a dynamic grammar that would change depending on some events (like during a conversation or current in game environment). An example of this would be to limit the strict grammar to the possible conversation choises (like with adventure games when you can select what to ask). An other example would be only provide a list with object names in the current room (so you could say "pick up <object in the room>"). The problem I encountered with this idea is that it's not possible to create a dynamic grammar file. Only grammar files with an .xml extention are accepted on the local grammars (stored on disk). For online grammars you will have to use an independed webserver, because the game's webserver doesn't respond during the LoadGrammar function call (it's not multi-threaded). Since the best purpose for speech recognision would be a single player game, using an external webserver just doesn't work well.
I might implement the second idea in the (near) future using an external webserver, or at least give it a try.