Independent Study VoiceXML

By Chetan Sharma

Under Supervision of Professor C. Tappert, Pace University


VoiceXML is a Web-based markup language for representing human-computer dialogs, just like HTML. While HTML assumes a graphical web browser, with display, keyboard, and mouse, VoiceXML is assumes a voice browser with audio output (computer-synthesized and/or recorded), and audio input (voice and/or keypad tones). VoiceXML leverages the Internet for voice application development and delivery, greatly simplifying these difficult tasks and creating new opportunities.

VoiceXML 1.0 is also a specification of the VoiceXML Forum, an industry consortium of over 300 companies.  The Forum is active in the conformance testing, education, and marketing of VoiceXML, and has given control over further language development to the World Wide Web Consortium (W3C).  Because it is a specification, applications that work on one conformant VoiceXML platform will work on others as well.

VoiceXML is a programming language for describing call flows for interactive voice applications. The VoiceXML language provides a clean and simple means for:

VoiceXML documents can perform programming functions such as arithmetic and text manipulation. This allows a document to check the validity of the user's input. Also, a user's session need not be a simple sequence that runs the same way every time. The document may include "if-then-else" decision making (branching) and other complex structures. Writing powerful documents is easier when you use Nuance Speech Objects. These are pieces of software that are pre-written, tested, and packaged in a form that is easy for a VoiceXML document to use. Speech Objects conduct dialogs for common functions such as accepting credit card numbers, times and dates, and dollar amounts.

The grammar in VXML is the most important aspect for developing a VXML application. Motorola, IBM, TellMe Studio, BeVocal are the few known firms which are working on VXML application development. The standard version of VXML is yet to develop, since each and every company.

VoiceXML is a derivative of the Extensible Markup Language (XML). XML is the standard format for defining structured documents and data on the Web. XML enables programmers to define an arbitrary vocabulary, formally known as a schema, using a standard, well-defined, easily parsed syntax. One XML schema might describe customer information, another might describe a mathematical equation, and yet another might describe a recipe for chocolate chip cookies. The initial project resulted from collaboration between IBM, Motorola, Lucent and AT&T. The current list of members covers a broad spectrum of the computer industry. We will explore how VoiceXML goes beyond the graphical user interfaces of HTML and provides a framework for the most natural form of communication: spoken language.

The world of VoiceXML is changing weekly. One of the first companies to offer a system for experimentation was IBM, through their alphaWorks program. They integrated an early version of VoiceXML with their ViaVoice speech technology. Most of the software can be downloaded free from The system supports Microsoft Windows and desktop recognition. Several companies have already deployed VoiceXML systems for their internal development work, but do not generally make their platforms available to developers. Nuance and SpeechWorks, both providers of telephone-based recognizers, have VoiceXML initiatives underway with extensive developer programs. Nuance has announced a dial-up phone system for testing scripts. It was scheduled for release in late August as of this writing and may be used free for 60 days. appears to be the furthest along with their Tellme Labs developer program. They offer a free dial-up number and a range of tools for developing and testing VoiceXML scripts. They even include a window that may be used while browsing their developer site to write and modify VoiceXML scripts that can then be immediately tested from their free dial-up number.

A system can also be built by assembling the necessary components outlined in Figure 1. This approach requires a much more extensive effort and understanding in putting the parts together. Telephony cards, DSPs and various servers must all be made to work together. The most widely used recognizers from AT&T, IBM, L&H/Dragon, Nuance and SpeechWorks were not initially designed to work with VoiceXML. Several of the dynamic aspects of VoiceXML make it more difficult to simply match the respective speech APIs to the VoiceXML requirements. The VoiceXML committee is still working on a standard capable of supporting the more complex grammar that each of the recognizers referenced above are built on. This is resulting in several different VoiceXML platform-dependent solutions that are likely to change over time.

A key value of VoiceXML, much like HTML, is the simplicity. It isn't a full-fledged programming language, and to a large extent it would be expected that, much like HTML, a new set of professionals would specialize in VoiceXML. Writing good VoiceXML dialogs requires a sense of what makes human dialog work. Some of the more proficient writers of VoiceXML dialogs that I've worked with have come from a background in linguistics, audio engineering or the broadcast industry. While XML can be a little alien for someone who has never programmed, the commands are easily learned. The art of good dialog design is an open challenge, a call to those with an understanding of verbal discourse.


Components of a VoiceXML System

Any web site can be a VoiceXML content server. No special hardware or software is necessary. Servers respond to requests by generating either canned or dynamically generated VoiceXML scripts, which are passed by HTTP back to the gateway. VoiceXML scripts look very much like HTML documents.

For example a <PROMPT> tag indicates that the gateway system should play back a piece of recorded audio to the customer. A <FIELD> tag is used to indicate an input field. The presence of the <FIELD> tag is a cue to the speech recognition engine to listen for user input and interpret it according to a grammar specified in the script.

Like conventional web pages, VoiceXML scripts may have embedded server-side or client (gateway-side) script. A specialized tag called <OBJECT> allows the incorporation of platform-specific functionality. Many VoiceXML scripts will probably contain a combination of "pure" VoiceXML and pre-written modular components written in Java or ActiveX.

Interpretation of the script and the interaction with the user is controlled by the VoiceXML gateway. Gateways are special collections of hardware and software which form the core of VoiceXML technology. Essentially they provide the presentation services component of VoiceXML, analogous to the web browser in conventional HTTP service.

Goals of VoiceXML

VoiceXMLs main goal is to bring the full power of web development and content delivery to voice response applications, and to free the authors of such applications from low-level programming and resource management. It enables integration of voice services with data services using the familiar client-server paradigm. A voice service is viewed as a sequence of interaction dialogs between a user and an implementation platform. Document servers provide the dialogs, which may be external to the implementation platform. Document servers maintain overall service logic, perform database and legacy system operations, and produce dialogs.


A VoiceXML document specifies each interaction dialog to be conducted by a VoiceXML interpreter. User input affects dialog interpretation and is collected into requests submitted to a document server. The document server may reply with another VoiceXML document to continue the users session with other dialogs.

Advantages of VoiceXML


              Minimizes client/server interactions by specifying multiple interactions per document.

              Shields application authors from low-level, and platform-specific details

              Separates user interaction code (in VoiceXML) from service logic (CGI scripts)

              Promotes service portability across implementation platforms. VoiceXML is a common language for content providers, tool providers, and platform providers

              Safely handles shared network-based applications. No arbitrary computations are allowed, and platform resources are protected

              Is easy to use for simple interactions, and yet extensible for complex ones



The language describes the human-machine interaction provided by voice response systems, which include the ability to:


              Synthesized speech output (text-to-speech).

              Output of audio files

              Recognition of spoken input

              Recognition of DTMF input

              Recording of spoken input

              Telephony features such as call transfer and disconnect

The language provides means for collecting character and/or spoken input, assigning the input to document-defined request variables, and making decisions that affect the interpretation of documents written in the language. A document may be linked to other documents through Universal Resource Identifiers (URIs). When a link is followed, request variables and their values, if present, are submitted to the link's URI.

Details on The Application I worked On:



VXML Use case - Using and IBMs Voice Server SDK


Commuter: Needs to know the next time a Metro-North train departs from a station on the Harlem line.


Dialog: (Note: Voice Server VS)


1: VS: Welcome to the Metro-North Harlem line VOICE train scheduler.

2: VS: This service will help you determine departure and travel time

3: VS: from your point of departure.

4: VS: Lets start,

5: VS: Where are you departing from?


COMMUTER: <speaks the train station name> $origin


6: VS: What is your destination?


COMMUTER: <speaks the train station name> $dest


7: VS: How long will it take you to arrive at $origin?


COMMUTER: <speaks time in minutes>


8: VS:

1.      Determines when the commuter will arrive at the start station plus a 5min grace period $departTime+$gracePeriod

2.      Determines the next train to arrive at the start station at this time. $arrive

3.      Then calculates the total travel time on the train to get to the destination station. $totalTime.

4.      The next train will depart from $start at $arrive with a $gracePeriod. The total travel time will be $totalTime.


9: VS: Would you like to submit another train itinerary?


COMMUTER: <speaks YES or NO>


10: VS:


1.      Determines YES or NO


a.       If YES then go to step 5

b.      If NO then "Thank you for using the Metro-North Harlem line VOICE train scheduler.