Skip to main content

skip to main content

developerWorks  >  XML | WebSphere  >

As easy as X+V

How to write an XHTML+Voice application in three, or so, easy steps

developerWorks
Document options

Document options requiring JavaScript are not displayed

Sample code


Rate this page

Help us improve this content


Level: Introductory

Jeff Kusnitz (jk@us.ibm.com), Senior Software Engineer, IBM

31 May 2005
Updated 08 Jul 2005

Everyone in the industry is hyping multimodal applications as the next cool way to build applications -- why limit users to a single input/output modality when they can use several at one time? X+V simplifies multimodal application development. This article takes you through the steps necessary to build a simple, but useful, multimodal application.

Archimedes once said "Gyros for dinner again?!" But he's probably more well known for having said "Give me a lever and a place to stand and I can move the earth." In a conversation recently, a friend said "X+V is great. All I need is Notepad and the Opera browser (v8) and I can write any multimodal application."

Wow! Is he for real? Can you actually build a real application with just a text editor and a browser? Have I really been wasting my time using assorted toolkits and XML editors and debuggers? I decided to find out for myself.

The test application

First things first, though. I needed an idea for an application. This friend suggested sending SMS messages. Sheesh, first he thinks all it takes to write an application is a browser and a text editor, and now he's saying that T9 is a challenging text input method. OK, fine, sending an SMS message it is. How hard can it be, right?

Step 1: Figure out how to send an SMS message from a Web page. It turns out that most of the cell phone providers have Web pages that you can send SMS messages from; just fill in a couple of fields on a form, click "submit," and the message is sent (I can't wait to see my telephone bill this month!). Unfortunately, their HTML wasn't quite as readable as I had hoped, but with a bit of perseverance, I made my way through a couple of providers' Web pages and had my own hacked-up HTML that would send SMS messages through their servers. My application was only for my own education, so I'm not too concerned about [ab]using their servers, but if this was a real application, I'd certainly make sure they were aware of what I was doing. That would have saved the whole hacking effort, in fact.

HTML and XHTML were close enough that it didn't take a significant amount of effort to understand XHTML. I even got a little fancy in my code and used label tags rather than a table to identify input elements. Yeah, I know, big deal. It was a big deal for me. I also added a "style" section. I took a bit of code from another application that worked, and found that it didn't work. After some manipulation, I found that Opera didn't like that I had enclosed my colors in quotation marks -- picky, picky. I grew up writing X-Window code; we always quoted things. The final XHTML form looked like Listing 1 (just ignore the event handler information in the body tag for now):


Listing 1. Final XHTML form

<body id="documentBody" ev:event="load" ev:handler="#getMessageInfo">
  <form name="doIt" method="get">
    <label for="recipient"> Recipient: </label>
    <select name="recipient" id="recipient">
      <option value=""></option>
      <option value="Igor"> Igor Jablokov </option>
      <option value="Jeff"> Jeff Kusnitz </option>
    </select>
     <p/>
   
    <label for="message"> Message: </label>
    <select name="message" id="message">
      <option value=""></option>
      <option value="Yes">Yes</option>
      <option value="No">No</option>
      <option value="Call me">Call me</option>
      <option value="I need directions">Need directions</option>
      <option value="Where are you?">Where are you?</option>
      <option value="I'll call you later">Will call   later</option>
      <option value="I'm busy">Busy</option>
     </select>
   
     <input type="button" value="Post Message" class="button"
            onclick="SendMessage()"/>
  </form>
</body>


With a working XHTML application (by "working," I mean I can choose a recipient and a message, then click on the "Post Message" button, and the message gets sent), the next step is to add voice to the application.



Back to top


Coding the voice part

About this time, I was thinking "Voice is easy, so maybe my friend was right." I'm a whiz with VoiceXML, so a Voice form should be easy. And it was. I quickly added a VoiceXML form with a pair of fields to the XHTML application and wrote a small external grammar with rules for each field. I also added a couple of sync tags, which handle keeping the Voice form's fields in sync with the XHTML form's fields. The Voice form looked like Listing 2.


Listing 2. Voice form
<vxml:form id="getMessageInfo">

  <vxml:field xv:id="person" name="person">
    <vxml:prompt> Who do you want to send the message to? </vxml:prompt>
    <vxml:grammar src="sms.grxml#person" type="application/srgs+xml"/>
    <vxml:nomatch>
      <vxml:prompt> Sorry, what was that? </vxml:prompt>
    </vxml:nomatch>
  </vxml:field>
  
  <vxml:field xv:id="message" name="message">
    <vxml:prompt> What message would you like to send? </vxml:prompt>
    <vxml:grammar src="sms.grxml#message" type="application/srgs+xml"/>
    <vxml:nomatch>
      <vxml:prompt> Sorry, what was that? </vxml:prompt>
    </vxml:nomatch>
   </vxml:field>
   
</vxml:form>

The grammar rules referenced by the two fields look like the code in Listing 3.


Listing 3. Grammar rules
<rule id="person" scope="public">
  <one-of>
    <item> Jeff  <tag> $ = "Jeff" </tag> </item>
    <item> Igor  <tag> $ = "Igor" </tag> </item>
  </one-of>
</rule>

<rule id="message" scope="public">
  <one-of>
    <item> Yes </item>
    <item> No </item>
    <item> Call me </item>
    <item> I need directions </item>
    <item> Where are you? </item>
    <item> I'll call you later </item>
    <item> I'm busy </item>
  </one-of>
</rule>

And the sync tags, which synchronized the different input modalities, looked like Listing 4.


Listing 4. sync tags
<xv:sync xv:input="recipient" xv:field="#person"/>
<xv:sync xv:input="message" xv:field="#message"/>

The highlighted words (recipient, person, message, and message) refer to the IDs assigned to the Voice and HTML form inputs.

So far so good. When the document is loaded into the browser, the event handler I added to the body element kicks off the Voice dialog, and I can fill in all of the fields by voice (the mouse still works, of course). Spectacular. There's only one small problem: When it was just XHTML, clicking on the "Post Message" button resulted in the message being sent. But somewhere after adding the Voice dialog, that broke. Now, clicking the button does nothing.



Back to top


Why isn't it working?!

It used to work (really, I tested it!). I dug through my code looking for some obvious error, but there were none. I was lost. Nothing had changed in my code relating to the button, and yet it had stopped working. As all good programmers do when faced with this sort of problem, I just started playing around with my code.

And during one iteration, I happened to add another button to the XHTML document outside the "doIt" form. Can you believe it worked? Yup, it did. So, then I moved the "Post Message" button out of the form and that worked again. Somehow the form was catching the button events and not sending them on to its children. Shame, shame Mr. Form. Bad form (pun intended).

But with that solved, or worked around anyway, I now had a pretty cool application. And so far, I'd built it entirely with a text editor (I use vi, sorry emacs people) and the Opera browser. The XHTML form was pretty standard stuff; there wasn't much room for improvement there. But on the voice side of things, I was using something that's known as "directed dialog" in the voice industry; essentially, the browser is prompting the user for each individual piece of information. For simple applications, directed dialog is just fine (it's also useful in some instances in complex applications, but that's another discussion altogether). I wanted my application to stand out, so the obvious thing to change is the dialog style.



Back to top


Empowerment through speech

An alternative to directed dialog is called mixed initiative. In a mixed initiative dialog, instead of the computer prompting the user for each individual piece of information it needs (the recipient and the message in this case), the user can speak a single phrase that fills in multiple fields. As an example, consider the following:

Computer: Who do you want to send the message to?
Me: Igor
Computer: What message would you like to send?
Me: I need directions

It's not bad, but imagine if there were more fields that needed to be filled in! A mixed initiative dialog is much more natural. Consider this dialog:

Computer: What would you like to do?
Me: Tell Igor that I need directions

In a single utterance, I've provided all of the pieces of information needed to complete the dialog. If information was missing still, the computer would fall back and use directed dialog to prompt me for them individually.

As it turns out, writing a mixed initiative dialog is not much more work after you have a working directed dialog. Essentially, two things need to be done. First, the Voice form needs to have some sort of high-level prompt, like the "What would you like to do?" in the dialog above. This is done through VoiceXML's initial element. Basically, it's a container for the prompts and some dialog logic. An example of an initial element is shown in Listing 5.


Listing 5. Example of initial element
<vxml:initial name="start">
  <vxml:prompt> What would you like to do? </vxml:prompt>
  <vxml:noinput count="3">
    <vxml:prompt> Let's try directed dialog </vxml:prompt>
    <vxml:assign name="start" expr="true"/>
  </vxml:noinput>
  <vxml:nomatch count="3">
    <vxml:prompt> Let's try directed dialog </vxml:prompt>
    <vxml:assign name="start" expr="true"/>
  </vxml:nomatch>
</vxml:initial>

It's not particularly elaborate. It gives the user a high-level prompt, and if there are three noinput events (the user didn't say anything), or three nomatch events (the user said something that didn't match any grammars), then the initial's "guard variable" ('start') is set to true and you revert to directed dialog.

The second change that has to take place involves the grammar, and is a bit more involved. Right now, each field in the Voice form references a particular rule in the grammar that accepts input specific to that field. If you want the user to be able to say more elaborate phrases, not only do you need a grammar that accepts these elaborate phrases, but you also need a mechanism to identify the important pieces of the grammar, and where they fit into the Voice dialog.



Back to top


It's all semantics

Enter semantic interpretation. Semantic interpretation is a language that's used within a grammar to indicate which parts of a phrase are important and which are "fluff." There are a number of styles of semantic interpretation in the industry. The Opera browser uses the April 1, 2003 draft of the W3C Semantic Interpretation for Speech Recognition Specification 1.0 (if you think the name is long, wait until you see the specification!). The language designers did a pretty good job with a fairly complicated task. Semantic interpretation isn't the easiest concept to pick up, but after you do, it's simple.

If you go back and look at the grammar rules earlier in the document, you'll see a couple of examples of semantic interpretation -- specifically, the information contained within the tag elements (for example, <tag> $ = "Jeff" </tag>). That's a simple example. A more elaborate (and suitable for our needs) example is shown in Listing 6.


Listing 6. A more elaborate example of semantic interpretation
<rule id="midialog" scope="public">
  <item>
    <item> Tell </item>
    <ruleref uri="#person"/>
    <item repeat="0-1">
      <one-of>
        <item> that </item>
        <item> to </item>
      </one-of>
    </item>
    <ruleref uri="#message"/>
    <tag> $.person = $person; $.message = $message; </tag>
  </item>
</rule>

The interesting parts of this rule are highlighted. The first two are rule references, which happen to reference the same rules that I used earlier in this document (code reuse in action). The third highlighted line is the semantic interpretation. The contents of the tag element are two ECMAScript expressions. The first assigns the results of the "person" rule (which is stored in the variable $person) to the property named "person" in the ECMAScript object named $ ($.person). The second stores the results of the message rule in $'s message property ($.message).

These property names are not arbitrary. They match the names of the fields in the Voice form (for example, <vxml:field xv:id="person" name="person">). This is how the browser knows which parts of the user's utterance should be used to fill which parts of the Voice form.

You also need to reference the newly written grammar rule. The grammar rules you've referenced within the individual fields are known as field-level grammars; they are only active when the individual fields that contain them are active (there are exceptions, of course). This newly created rule isn't tied to a particular field, but, rather, is active whenever the form is active. It's called a form-level grammar, and is referenced with a grammar tag just like a field-level grammar is, but its parent is the form rather than an individual field.



Back to top


Checking your work

With the form-level grammar and the initial element added to the Voice form, I started the browser to give the application a whirl. Unfortunately, it didn't work. In addition to those magic words ("What would you like to do?"), I also heard "an error has occurred in this application." I looked through the error log and saw that the problem was with the grammar, so I brought up an XML editor, which told me I hadn't closed a <one-of> (don't worry, it's closed in the code above). I'm sure I could have looked through the grammar one line at a time and would have eventually noticed the problem, but it was sure handy to have an intelligent editor to do the looking for me.

After the error was fixed, the application worked great. I can turn the microphone on and spout out a complete phrase, have it recognized, and have all of the XHTML form fields populated automatically. But there's still one thing missing (I freely admit that I'm an overachiever).

After I've spoken the phrase and filled in all of the fields, why would I need to manually click the "Post Message" button? The answer is, of course, I don't. As it turns out, when the Voice form is filled in, a vxmldone event is passed back to the XHTML document to handle or ignore as it sees fit. By adding a listener for that event, I can programmatically "click" on the "Post Message" button. The handler looks like Listing 7.


Listing 7. Handler for Post Message button
<script id="doneHandler" declare="declare"
        ev:event="vxmldone" ev:observer="documentBody">
  SendMessage ( );
</script>

I also added a filled handler to the Voice dialog to provide a bit of feedback when both fields are filled in. That code is even simpler, as shown in Listing 8.


Listing 8. Filled handler to provide feedback
<vxml:filled mode="all">
  <vxml:prompt> Ok got them. </vxml:prompt>
</vxml:filled>



Back to top


In conclusion

And that's all. I might have gone a little overboard with the application, but it's not "pretty cool" anymore; it's "very cool," though I've been getting instant messages from friends telling me to "knock it off."

As to my friend's claim that he can make a really cool application using X+V with just a text editor and the Opera browser, well, I suppose if you never make a typo, he's absolutely right. But for those of us who make typos once in a while, a verbose log and an XML-aware editor can't be beat.




Back to top


Download

DescriptionNameSizeDownload method
code samplewi-messageCode001.zip4 KBHTTP
Information about download methods


Resources



About the author

Jeff Kusnitz has worked at IBM for 18 years, focusing almost exclusively on telephony and speech recognition platforms. In his current position with the IBM Software Group, Jeff spends most of his free time snorkeling on the island of Maui and working with standards organizations, including the W3C and the VoiceXML Forum, and working on the IBM Reusable Dialog Component framework and Natural Language Understanding technology.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top