Computer
Networks
and ISDN
Systems
29 ( 1997) 1377-
I383
Cascaded speechstyle sheets’ T.V. Raman 2 Advanced
Technology
Group,
Adobe
Systems. PO. Box 7900, Mountainview,
CA, USA
-Abstract Cascading Style Sheets (CSS) enable WWW designers to separate layout from content on a WWW site and help the site designer ,customize the look and feel of a site without having to edit all the pages making up the site. Stylesheets are often thought of as a means to specify the visual appearance of a WWW page. This paper takes a more general view; CSS style sheets can in fact be used equally well to control the appearance of a WWW sitewhenpresented in non-traditionalmodalitiessuchasspeech.This paperoutlinesthe reasoningbehindthe designof the speechstyle sheet specificationand describesa working implementationthat produceshigh-quality audioformatted spokenrenderingsof well-authoredWWW content.The paperreinforcesthe needto keepWWW site designindependentof specificbrowser implementations of today by demonstratingthe ability to specify aural renderingsthat can in principle be completely separatefrom the visual appearance of a WWW pagegiven a well-structuredcollection of HTML documents. Q 1997 Published
by Eilsevier
Science B.V.
Keywords: Speech;Audio; Renderingstyles
1. Design goals We begin this paper with an outline of the design goalsbehind the CSS speechspecification. The work is motivated by earlier work on audio formatting introduced in AsTeR - Audio System For Technical Readings. Audio formatting is a process analogous to the familiar notion of visual formatting. Visual formatting projects information structure onto a visual display in a manner that makes it easy for the eye to perceive and navigate the information structure envisiclnedby the author. Visual layout exploits -’ Disclaimer: The opinions expressed way should be taken as representative Systems Inc. 21%mail:
[email protected] 0169-7552/97/$17.00 0 199’7 Published PII SO169-7552(97)00065-2
features of the two-dimensional display and the eye’s ability to randomly accessportions of this display in presenting information efficiently. Audio formatting achieves an analogous effect in the domain of aural renderings. Unlike with visual renderings where the reader actively browses a relatively passive display, audio renderings are characterized by an actively scrolling display that flows linearly past a passive listener. Audio formatting in conjunction with structure-based audio browsing compensatesfor these deficiencies and producesrich interactive audio documents that are as efficient to useas their visual counterparts.
are my own and in no of my employer, Adobe
by Elsevier
Science B.V. All rights reserved.
1378
TV Raman/Computer
Networks
2. The CSS speech specification: style sheets for producing spoken renderings, ~1.5 2.1. Purpose of speech CSS
This document defined style-sheet extensions that add property-value definitions specific to aural renderings. This initial specification attempts to define properties that will be general while at the same time allowing browser implementors maximal flexibility in exploiting the features provided by different auditory displays. As the functionality provided by such displays becomes standardized this specification will evolve to encompass the features they provide. Note that speech style-sheets play the dual role of specifying how a document should be rendered aurally to a user who is functionally blind, i.e. is not currently looking at a visual display, and may also specify how a visual rendering should be augmented with sound cues to provide a truly multimodal rendering. Examples of situations where a user is finctionally blind in terms of looking at the computer display include: l A user wishing to read a document while driving. l Users engaged in other eyes-busy tasks. l A visually impaired user. 2.2. Design philosophy
A simple minded approach would dictate that an aural browser use the information present in the standard stylesheet to convey the same information aurally. This would only fit the scenario of producing a faithful aural presentation of a WWW document to someone who cannot see the visual display. Such an aural rendering is not desirable in general because the decisions made by a visual rendering system such as line and page breaks are irrelevant to the listener. Aurally rendering the visual layout is not adequate to convey structural information aurally. We adopt the more sophisticated solution of defining a separate (possibly cascaded) speech style-sheet so as to: l Realize that the aural rendering is essentially independent of the visual rendering. l Allow orthogonal aural and visual views. l Allow complementary aural and visual views.
and ISDN
Systems 29 (1997)
1377-1383
Allow future browsers to optionally implement both aural and visual views to produce truly multimodal documents. This said, an auditory browser is free to use the information provided by the standard visual stylesheet to augment the aural rendering where necessary. Thus, when rendering a well-written document that uses the emphasis tag to mark emphasized phrases, such an aural browser would use the speech properties specified for emphasis in the speech stylesheet. However, if a document uses layout specific tags such as
an aural browser can fall back on a default rendering that maps specific speech properties to the visual layout tags. In general, the speech stylesheet will not attempt to specify the mapping between visual layout tags and speech properties, instead leaving it to specific browser implementations and user-specific stylesheets to decide how such tags are rendered. l
3. Aural properties In the following, we enumerate each property along with its possible values. Explanatory paragraphs describe how a browser might use such properties and their possible effect. The syntax used in the speech style-sheet will be the same as defined in CSSl - hence, this document will not explicitly define the syntax. For all purposes, this document should be considered as an appendix to (or part of) the CSS 1 specification. In the following, we enumerate a collection of aural properties that allows designers to exploit the capabilities of a wide range of auditory displays. Implementors using simpler audio output devices are free to map properties specified by a style-sheet to audio properties that are available on a particular device. We provide this flexibility to allow a rich collection of aural renderings. The field of audio formatting is relatively new (see AsTeR - Audio System For Technical Readings for research defining some of the key notions in this area. Also see Janet Cahn’s Masters Thesis entitled Generating Expression in Synthesized Speech (Copyright MIT 1990) for additional examples of varying speech synthesis parameters to produce interesting effects. Restricting the style sheet specification language to the
i?V Raman/Computer
Networks
and ISDN
constraints of lower quality devices would throttle research in this field. 3.1. Speech properties
Speech properties specify the voice characterestic to be used when rendering specific document elements. :volume level [0 1 2 3 4 5 6 7 8 9 lo] or (nnndb) (specified in decibels ) or [soft lmedium I loud ] The volume of the speaker. Specified as a numeric level, in decibels or using the keywords soft, medium or loud. The volume if specified as a level is mapped by the implementation of the UA to an appropriate device setting with a setting of 5 interpreted as “medium”. :left-volume number 1-- 100 (percentage) Specifies the speaker volume for the left-channel. Devices not supporting stereo output may ignore this setting. right-volume number l--100 (percent) Specifies the speaker volume for the right-channel. Devices not supporting stereo output may ignore thiY setting. :voice-family string Analogous to the :font-family property. This specifies the kind of voice to be used, and can be something generic such as mule or something more specific such as comedian or something very specific such as Paul. We recommend the same approach as used in the case of :font-family - the style sheet provides a list of possible values ranging from most to least specific and allows the browser to pick the most specific voice that it can find on the output device in use. :speech-rate level [l-LO] or number (NNNwpm) (words per minute) or [slow I medium I fast] Specifies the speaking rate. If specified as a level, 5 is interpreted as medium.
Systems
29 (1997)
1377-1383
1379
:average-pitch level [l-lo] or > number (NNNhz) (hertz) Specifies the average pitch of the speaking voice in hertz (Hz). The average pitch is the fundamental frequency of the speaking voice. Lowering it typically produces a deeper voice - increasing it produces a higher pitched voice. Listen to AsTeR rendering superscripts and subscripts for an example of this effect. :pitch-range number (percentage variation O-200) Specifies variation in average pitch. A pitch range of 0 produces a flat, monotonic voice. A pitch range of 100 produces normal inflection. Pitch ranges greater than 100 produce animated voices. Less sophisticated speech output devices specify a simple prosody setting that acts as a toggle that sets this value to either 0 or 100. :stress number (O-100) Specifies the levei of stress (assertiveness or emphasis) of the speaking voice. English is a stressed language, and different parts of a sentence are assigned primary, secondary or tertiary stress. The value of property :stress controls amount of inflection that results from these stress markers. Different speech devices may require the setting of one or more device-specific parameters to achieve this effect. Increasing the value of this property results in the speech being more strongly inflected. It is in a sense dual to property :pitch-range and is provided to allow developpers to exploit higherend auditory displays. The resulting voice sounds excited or animated. richness number (O-100) Specifies the richness (brightness) of the speaking voice. Different speech devices may require the setting of one or more device-specific parameters to achieve this effect. The effect of increasing richness is to produce a voice that carries - reducing richness produces a soft, mellifluous voice. For an example
1380
T.K Raman/Computer
Networks
and ISDN Systems 29 (1997)
of continuously reducing richness listen to AsTeR rendering a continuous fraction. Note: In the above example of a continuous fraction the voice also grows more animated - this is a result of increasing the value of property :stress. :speech-other list of name value pairs Allows implementors to experiment with features available on specific speech devices. The use of this property is device-specific, but is provided as an escape mechanismsince auditory displays are not yet as standardized as their visual counterparts. Implementors are encouraged to use this property only where absolutely necessary. In many cases, the desired effect can be abstracted using the properties defined earlier and having the device-specific component of the browser map a single abstract property to a collection of device specific properties. In general, we expect document specific style sheets to completely avoid this escape mechanism to ensure that documents remain device-independent. User-specific and UA-specific local stylesheets may choose to use this facility to enhance the presentation. 3.2. Miscellaneous
speech settings
In addition to specifying voice properties, a speech style-sheet also specifies auxillary information such as the amount of pause to insert before or after rendering document elements. Pause can be used to great effect in conveying structural information. Experience with AsTeR (Audio System For Technical Readings) has shown that small amounts of pause - 5 to 20 milliseconds - can prove perceptually significant and aid in the grouping of mathematical subexpressions. listen to AsTeR rendering simple fractions 3 where pauses are used effectivley to convey grouping. :pause-before number (milliseconds) Amount of pause. (analogous to white space.) Specifies the number of milliseconds of silence
1377-1383
to insert before rendering a document element. In situations where the :pause-before intersects the :pause-after of the preceding document element, we compute the amount of pause to insert in a manner similar to that used to compute the amount of intervening whitespace in producing visual renderings. :pause-after number (milliseconds) Amount of pause. (analogous to white space.) Specifies the number of milliseconds of silence to insert after rendering a document element. :pause-around number (milliseconds) Amount of pause. (analogous to white space.) Specifies the number of milliseconds of silence to insert before and after rendering a document element. Though this effect can be achieved by using :pause-before and :pause-after in conjunction, style-sheet designers are encouraged to use :pause-around where appropriate since it makes the intent clearer. Perhaps :before :after and :around should be modifiers so they can be generally applied to other property settings? :pronunciation-mode string Specify the pronunciation mode to be used when speaking a document element. Pronunciation modes can include l Speak all punctuation marks l Speak only some punctuations. In this case, the rule for handling punctuation marks is specified by providing a value for property :punctuation-marks-to-skip or :punctuation-marks-tospeak. l Speak contents as a date. 0 Speak contents as a time string The set of values for this property is left open so that designers can exploit all features available in a specific device. Style-sheet designers can specify a list of values for specifying a particular option in a amanner analogous to that described in specifying :voice-family. Browsers are expected to choose the most specific setting available on the current output device. Thus, for
ZV Ranan/Computer
Networks
and ISDN
property :spea.k-time, a style sheet could specify :speak-military-time and :speak-am-pm etc. The device-specific component of a browser is expected to map those values that it does not understand to a suitable default. Alternatively, the device-specific component of the browser may choose to transform the contents of the document element to a form that is suitable to be rendered by the specific device. To give an example: Consider ,the value date-string. Given a content string of the form Jan I, 19% an aural browser could: 0 Ignore property :pronunciation-mode. l Send tlhe content string directly to a smart speech device capable of switching to a speak date mode. l Apply .an appropriate transform in this example, change Jan to January - when communicating with a less sophisticated output device. :language string Language to use when rendering the contents of the document element. Specified by using the appropriate IS0 encoding for international languages. :country string Specified using IS0 encoding for specifying country codes. Can be used in conjunction with :language to specify British or American English. (See property :dialect below for variations in speaking style within a country.) This property will be useful for multilingual speech devices capable of switching between languages. :dialect string Specifies the dialect to be used, e.g.: americanmid-western-english. 3.3. Non-speech auditory cues
Non-speech sounds can be used to produce auditory icons. Such auditory icons serve to augment the aural rendering and provide succinct cues.
Systems 29 (1997)
1377-1383
1381
:before-sound URL. Specifies a file containing sound data. The sound is played before rendering the document element to produce an auditory icon. An optional :cue-volume can specify a volume scaling to be applied to the sound before playing it. :after-sound URL. Specifies a file containing sound data. The sound is played after rendering the document element to produce an auditory icon. An optional :cue-volume can specify a volume scaling to be applied to the sound before playing it. :around-sound URL. Specifies a file containing sound data. The sound is played around rendering the document element to produce an auditory icon. An optional :cue-volume can specify a volume scaling to be applied to the sound before playing it. :during-sound URL. Specifies a file containing sound data. The sound is played repeatedly during rendering the document element to produce an auditory icon that provides an aural backdrop. 3.4. Advanced settings
In the future, auditory displays may want to exploit spatial audio for producing rich aural layout. Spatial audio - a digital signal processing technique that involves convolving sound data with appropriate filters to produce spatially located sounds - can be used to make sounds appear to originate from different points in the listener’s auditory space and is popularly referred to as three-dimensional sound. :spatial-audio :azimuth number :elevation number Azimuth and elevation are specified in degrees and together specify the point in auditory space from which the sound appears to originate.
1382
ZV Raman/Computer
4. Implementation
Networks
and ISDN
of the speech CSS
- Emacspeak and W3
Emacspeak is a full-fledged speech interface to GNU Emacs. Emacspeak should be thought of as the speech analogue to the visual GUI; just as a GUI presents a pleasant visual front-end to a set of computing applications, Emacspeak enables fluent spoken feedback from a varied collection of applications. One of Emacspeak’s primary strengths is its ability to produce context-specific feedback as opposed to just speaking the screen. Thus, whereas traditional screen-reading applications for the blind merely speak the screen contents out loud - leaving it to the user to interpret the context and intent of the visual display - Emacspeak presents information aurally as opposed to speaking the screen. See this short talk on Emacspeak6 for additional details on the speech-enabling approach introduced in Emacspeak. The Emacspeak extension to the Emacs W3 browser - a powerful WWW browser implemented entirely in Emacs lisp - extends the speech-enabling approach to the domain of WWW browsers. Emac: speech : /* family, hl,h2,h3, h4,h5,h6
h5
{ hl { h2 { h3 { h4 { { pitch:
pitch,
pitch-range,
stress,
voice-family: pitch: pitch: pitch: pitch: 5;
Paul; 1; pitch-range: 2; pitch-range: 3; pitch-range: 4; pitch-range: pitch-range:
stress:
5;
9; 8; 7; 6; }
link click here to submit An Emacspeak user instead gets the link text spoken in a slightly different voice; the change in voice characteristic implicitly cues the listener to the fact that the text click here to submit is a link. 4.2. Example speech style sheet This section gives a sample instance of a speech style sheet. This is extracted from my currently running instance of Emacspeak and W3 - it represents an example and reflects my own personal tastes. The Emacs W3 browser divides the stylsheet into device-specific sections; only the section that is specific to an auditory display is shown below.
richness
2; 1 1 1 1
1377-1383
speak interprets the structure of HTML documents as they ar displayed and speaks the contents using sophisticated audio rendering techniques. In an early implementation of this approach, we relied exclusively on browser-specific mechanisms to achieve these effects; -at the time this was how all UAs (both visual and aural) achieved high-quality renderings. The current implementation of the speech extension to W3 in Emacspeak uses the CSS mechanism and is implemented as per the specification given in the previous section. An Emacspeak user can customize a local or personal stylesheet instance to produce audio renderings that are attuned to individual tastes. Using speech styles to convey structural information about the content being presented has the advantage of making the resulting renderings succinct. In the absence of speech styles, a listener would hear added verbiage such as
This section describes the first implementation of cascaded speech stylesheets. Emacspeak4 - a full-fledged speech interface to GNU Emacs - provides an audio-enabled desktop. The W35 browser by William Pery is used in conjunction with the spoken environment provided by Emacspeak to implement a powerful specification compliant solution that provides audio formatted renderings of ~W documents. 4.1. Implementation
Systems 29 (1997)
richness:
“/
9;
1
TV Raman/Computer
Networks
and ISDN
Systems 29 (1997)
h6 { pitch: 6; pitch-range: 4; 1 li { pitch: 6; richness: 6; ) dt { pitch: 6; stress: 8; richness: 6; } dd { pitch: 6; richness: 6; } pre,xmp, plaintext { pitch: 5.; pitch-range: 1; stress: 1; richness: key,code, tt ( pitch: I-; pitch-range: 1; stress: 1; richness: em { pitch: 6; pitch-range: 6; stress: 6; richness:5 } strong { pitch: 6; pitch-range: 6; stress: 9; richness: 9; dfn { pitch: '7; pitch-range: 6; stress: 6; } s,strike { richness: 0; I i { pitch: 6; pitch-range: 6; stress: 6; richness: b { pitch: 6; pitch-range: 6; stress: 9; richness: u { richness: 0; 1 a:link { voice-family: harry; 1 a:visited { voice-family: betty;} { voice-family: betty; pitch-range: 8; pitch: 8 a:active
References [I]
T.V. Raman, Audio system for technical readings, Ph.D. thesis, Cornell University, Ithaca, NY, May 1994, http://www.c s.comell.edu/home/raman; Audio formatted thesis produced by AsTeR is available from Recording for the Blind (RIB order number FB 190) [2] T.V. Raman and D. Gries, Documents mean more than just paper!, in: Proc. 2nd. International Workshop on the Principles of Docvment Processing, 1994 [3] B. Hayes, Speaking of mathematics, American Scientist,
1383
1377-1383
8;
I
8;
}
1
9;
5 1 }
1
84(2), March-April 1996 [4] T.V. Raman. Emacspeak - a speech interface. in: Proc. CHl94, April 1996 [5] G.C. Vanderheiden, Building disability access directly into next-generation information and transaction systems, &mote Speech. Association of Computing Machinery (ACM}, Tokyo, Japan, March 1994; gopher://trace.wisc.edu//OO/ftp/ PUBITBXTITRACEPAP/ACMJAPAN.TXT, Trace R&D Cen. ter, University of Wisconsin, Madison 161W. Gibbs. Envisioning speech, Scient$?c American, September 1996