Theory Behind the DTBook DTD

Theory Behind the DTBook DTD

by George Kerscher, Senior Officer Accessible Information
Recording For the Blind & Dyslexic (RFB&D)
Project Manager to the DAISY Consortium

Archived publication, first published September 2001


Many organizations around the world providing analog talking books to persons who are blind or print disabled realized that a digital format could, if designed properly, deliver far greater functionality and sound quality to the readers they serve. These organizations also predicted the end of the analog cassette as an economically viable way to deliver their services. Technologically oriented organizations had already started an etext service in one form or another and had learned about the strengths of this format over the analog cassettes they had delivered for decades. There existed a wide variety of analog formats in use and an even wider variety of specifications for the etext products. It was 1996 when these early pioneers founded the Digital Audio-based Information SYstem (DAISY) Consortium with the goal of developing the standards for the next generation of information technology for persons who are blind or print disabled.

The DAISY Consortium’s technologists analyzed the analog formats and etext specifications, and then embarked on research and development that would combine the strengths of human narration embodied in the analog format with the strengths of evolving etext implementations. The first standard to evolve (DAISY 1.0) demonstrated hierarchical heading navigation, and the ability to go directly to page numbers.

The DAISY 2.0 specification, which became a recommendation in 1998 and was last revised as DAISY 2.02 in 2001, built on the internet standards of the World Wide Web Consortium (W3C) . It uses HTML, XHTML, and SMIL as the fundamental framework on which to create the standards.

In 1997 the National Library Service for the Blind and Physically Handicapped (NLS) in the USA invited the DAISY Consortium and North American organizations serving persons with print disabilities to join them in working through the National Information Standards Organization (NISO) to develop further standards for Digital Talking Books (DTB). The DAISY Consortium decided to put their expertise and experience to work on the NISO committee and develop a third-generation standard in conjunction with NLS and its partner agencies. Based on DAISY’s experience building synchronized text and audio presentations, the development of a new XML DTD geared toward the conversion of print publications into a file set to support DTBs began.


The NISO Digital Talking Books Committee has defined an XML element set (DTBook)to represent the content and structure of books and other publications presented in digital talking book format. This element set borrows heavily from the W3C’s HTML 4.0 Specification, and adds specific structural tags required to accurately and unambiguously represent the content. Because XML is used, the tagged text files of DTBs can be validated for conformance with the document type definition (DTBook.dtd) that defines the DTB element set. The use of HTML core tags in the DTD also has several benefits, including the ability to take existing HTML-based content and add DTB-conforming structure elements, and to provide content that is easily rendered on visual or synthetic speech-based devices.

The DTBook DTD is unique in that it provides several classes of critical information within the structure of XML. The content is provided within semantically rich elements. The semantics are known to the system as a whole and provide essential functionality required by persons who are blind and print disabled. The semantics of the DTD as a whole and the individual elements include:

  • hierarchical global navigation targets
  • sequential reading order
  • reading choices (e.g., to read or skip all footnotes)
  • book component identification
  • local reading methods tailored to type of book component (e.g., tables are read differently from paragraphs)

Inherent in the DTD are the concepts of navigation, reading order, and reading options. While the DTD provides traditional semantics about the various block and inline elements, it is the pervasive notion of overall reading functionality that sets the DTD apart from other approaches. The theory behind the DTD will be discussed in this document.


The defining features of the DTBook DTD are based on the global and local navigation requirements of the end-user. These requirements were gathered directly from users and are laid out in the Document Navigation Features List [Navigation Features].

Global Navigation

Global navigation is efficient movement by a user to a portion of a book the reader wishes to read (e.g., section 8.3 or Appendix 7), with that movement enabled by the Navigation Control File (NCX) . Two categories of destinations are found in the elements of the DTD. The first type is the hierarchical arrangement of headings, e.g., nested chapters, sections, sub-sections, etc., down to six levels denoted by the level1level6 tags or by the unlimited recursive use of the level element. The levels are containers only and ensure the strict hierarchical arrangement of their contents and headings. The headings themselves are the targets for navigation, because the container elements have no meaning to the reader. Global navigation can make it easy for the reader to move to a heading by expanding or collapsing the hierarchical view of the document. Efficient global navigation is accomplished through the NCX described elsewhere in the standard, but it is the strict hierarchy of headings within corresponding “level” containers that provides the ability to automatically create the hierarchical view in the NCX.

The heading element names themselves (h1h6) were derived from the familiar HTML 4.0 DTD but their use is more tightly controlled in DTBook. Where HTML allows a document to apply the heading tags in any order, DTBook requires that an h1 be used only within level1h2 only immediately within level2, etc. The containers that wrap the headings are unique to this DTD and are not found in HTML. Use of the recursive level container that does not have a numeric identifier is perhaps more elegant, but requires a view of the document’s tree structure during authoring or editing. Either approach can be used, but using both within a single document is discouraged.

The second type of global navigation destination provided in the DTD is comprised of text elements that a reader may want to skip on a first reading, such as footnotes. This is easily done by a sighted person reading a printed book; the reader simply chooses not to move to the bottom of the page to read the footnote. However, the reader might choose to later revisit a book and read the footnotes of interest. This functionality can be duplicated in a DTB. Anything that the user can turn off (or merely be notified of) in an automatic rendering can later be directly navigated by the reader. This principle therefore requires that certain types of elements be identified for automatic extraction into the NCX.

We can identify certain types of text elements where this principle can be applied. There are page numbers, which most people do not read, but certainly use for navigation. Footnotes, endnotes at the end of a chapter or rear notes found at the end of a book all share the same function and are only differentiated by their physical location in the printed book. Annotations share similar functionality, but differ in that their position may be much closer (in the margin) to the item referencing the annotation.

Other items found in books are the interesting, but not essential, asides that occur throughout many types of books. Normally presented as sidebars these are items that the reader may choose to read on the first pass through a book or at a later time. While discriminating between sidebars and the main text is easy in a print book, it can be more difficult in an audio version, where these asides can be distracting to the reader. Print-disabled readers have therefore asked for the capability to disable the automatic presentation of sidebars.

In the version of books produced specifically for persons who are blind and print-disabled, producers add descriptions of visual elements such as photographs, illustrations, graphs, and charts. While this is extremely important information, it can be very time-consuming to read. Readers have requested the ability to selectively turn off the automatic rendering of these “producer notes” and return to them later to hear the detailed descriptions. All of the above elements are defined in DTBook.

Specifically, the elements defined in the DTD for the functionality described above are:

this identifies page numbers.
The sidebars or asides in books are normally optional reading and can be identified as such so the reader can disable the automatic rendering of these items. It is important to note that another element, notice, identifies text elements also normally found in margins but essential to the reader’s safety or understanding. Warnings, cautions, and other essential information found in boxes are tagged as notice, rather than sidebar, so that players can discriminate between the two types and prevent readers from disabling critical information. Sidebars should be referenced in the text so that the reader is aware of their presence if they are disabled.
Producer notes are information added by the producing organizations that makes versions of printed books accessible to the blind and print-disabled. The prodnote element is provided with an attribute to distinguish if this is a producer’s note which provides essential or supplemental information. The nonessential information can be turned off by the reader and accessed later. The reader should be notified of disabled prodnotes.
noteref and note
Footnotes, endnotes, and rear notes, fall into the category of paired items. The noteref is the reference in the text to further information found elsewhere. The note is the target content of the reference. For global navigation purposes, the noteref is included in the NCX and is presented to the reader in context (i.e., with adjacent text; implementation is player-dependent) to aid the reader in deciding whether or not to read the note itself.
annoref and annotation
Annotations of text are identified by the annotation element. The annoref is the content referenced by the annotation normally found near the referenced text. This is another example of paired elements. Their rendering is independent of notes. The annoref may be explicit with a superscripted number or it may be implied by highlighting. For example, a annotated poem may have words in bold that are described by the annotation in the adjacent margin. Regardless whether the item is implied or explicit, the annoref is included in the NCX for global navigation purposes. Like the noteref, the presentation in context is implementation dependent.

Local Navigation

Local navigation (one could also call this function simply “reading”) comprises movement within a single text element such as a list or table, or within a narrow range of text elements such as a group of words, sentences or paragraphs. Reading is an interaction between the reader and the content which involves the reader absorbing letters, words, sentences, paragraphs, list items, table cells, etc. while controlling the rate and making instant adjustments to reread a word or sentence, or choosing to jump to the next paragraph, capture the paragraph’s topic and move on. The reading system for a DTB can not only enable this functionality, but make the entire process completely natural and fluid.

The global navigation described earlier is enabled through the functionality of the NCX. Local navigation/reading is made possible by the XML encoding of the text. Whether a DTB contains a full multimedia presentation or only text, the XML-encoded text provides the local reading control. This is accomplished in a SMIL DTB through bidirectional references. The XML elements use a smilRef which points to the par of the SMIL presentation. This establishes the bidirectional relationship. The XML text content points to the SMIL par and the Par points to the XML text element. While the SMIL is controlling the output of the simultaneous presentation of audio, text, and images (others are possible), decisions by the reader can change the position of the SMIL presentation down to the level of granularity the presentation allows. In other words, if the reader chooses to move the current focus to the next element at the same level, the SMIL presentation transitions to the par whose URI matches that of the target element. If the current element is a block level element such as a paragraph, and the reader moves to the next paragraph, the URI of the smilRef of the target paragraph is selected and passed to the SMIL player for synchronization.

If the DTB incorporates sentence- or word-level synchronization and the reader elects to move to the next element, then the system moves to the next inline element and passes its URI to the SMIL player. The content of the smilRef in the XML-encoded text is the controlling mechanism for local navigation through the SMIL presentation.

The DTBook DTD contains all of the elements necessary for complete user control of reading. The block level elements are mostly borrowed from HTML 4.0. For inline applications, the common elements needed for a rich reading experience were assembled. Some elements such as word were added to the DTD to provide finely-grained synchronization with the SMIL presentation. Similarly, the element sent was introduced for sentence level synchronization. Word-level synchronization will normally be accomplished through software that examines the text file and recognizes the corresponding words in the audio stream. The bidirectional references and the positions in audio space will be set by the software.

Further details on navigation in tables, lists, etc. can be found in the Document Navigation Features List [Navigation Features].

Transition Between Global and Local Navigation

The combination of global and local navigation allows the reader to move efficiently through a digital talking book, adjusting the granularity of individual jumps appropriately. These two functions are defined separately because their functionality is accomplished in distinctly different ways — global navigation through the NCX and the local through the structure of the XML source document. Both navigation capabilities exist simultaneously. One can always invoke global navigation and move to a different destination in the NCX; and because the destination is itself a single component of the text being read, local reading control or navigation is also present once one has arrived at the target. For example while reading mid-paragraph, the reader can elect to go to the next heading (global navigation). On the other hand, the reader may elect to move to the next paragraph or list item (local navigation). Both local and global navigation are concurrently possible in a DTB. So the NCX is “omnipresent”. This is a very powerful concept and it also distinguishes a reading system that is simply a SMIL player using the NCX, from a comprehensive reading system that complies strictly with this standard.