Standards and Technologies

Standards

There are few established standards in this anarchic area at present: EPUB3 is a developing standard for electronic publishing that will require compliant publications to make provision for workable linked or embedded indexes. Details and progress reports can be obtained from the Indexes Working Group on IDPF.

There is as yet no universally recognised standard for embedded indexing. All those word processing, desktop publishing and authoring software tools that incorporate some kind of indexing module, and most of them do, seem to observe their own rules. But there does seem to be an emerging de facto standard for XML embedding, and that is the DocBook system, regulated by the OASIS Consortium.

Technologies

For an overview of the current impact of new technology on indexing, a good, accessible starting point is the March 2012 issue of The Indexer, available as hard copy to Society of Indexers members and online via The Indexer website. Society of Indexers members should login following the instructions in the members' area of the Society website.

A huge number of prognostications have been published about emerging publishing technologies, ranging from confident claims that the codex is doomed to equally assured ones that ebooks will be rejected by the public within a decade. We intend to collect those by authors with at least some appreciation of the retrieval problems presented by eBooks, for example, and this is at present a small subset. We should also like to add critical reviews by indexers of books and articles on new technology written by non-indexers and any especially valuable material drawn from discussion lists. Meanwhile, the distinct areas of tagging and embedded indexing, and the significance of HTML and XML have been reviewed by:

James Lamb on his blog ('Future publishing technologies and indexes', February 2011)
Bill Johncocks in the March 2012 issue of The Indexer ('New technology and public perceptions', The Indexer, 30(1), 6-10, (March 2012))

SI members interested in embedded indexing training should take a look at the MS Word Embedded Indexing Online Workshop and the short video introduction to embedded indexing.

Also of interest is Kevin Broccoli's Berkeley Extension course on Embedded Indexing and Indexing of Ebooks.

There are currently three important technologies in the area of device-neutral electronic publishing, each of which facilitates the linking of an index entry to the point at which it is discussed in the unpaginated, flowable text. They are embedded indexing, hyperlinking, and tagging. Here are a few words on how they interrelate, with links to more detail:

Embedded indexing
This technique was originally designed for text that would eventually be paginated; usually it simply moved the indexing operation so it could be completed before the page layout stage, saving time. Embedded indexing (EI) involves inserting actual index terms at appropriate locations in the electronic version of the indexed document, in a form whose visibility can be switched on or off.

After the document is paginated, the EI software builds the index automatically from this markup by associating the final page locations with each embedded index entry. It also performs the sorting, suppression of duplication and formatting necessary to produce a usable index with no further involvement by the indexer.

The two essential features of the embedded index, which are not always appreciated, are first, that the indexing exists as hidden text, which means its display can be turned on and off. Second, there is no separate index file: modifications to the original document wholly contain and retain the index.

EI overcame many drawbacks of the traditional page-based back-of-the-book index:
- First (depending on the client's software choices) indexing the author's manuscript can be performed at an early stage, as soon as the text is reasonably stable. Then, after pagination, simply running the software generates a finished index. There need be no opportunity cost penalty associated with human indexing.
- Second, the indexed text will be tolerant of subsequent deletions, format changes and content rearrangements, provided they are carried out before the final, index-generating software run. The affected index terms will simply disappear (in the case of deletion) or move to new locations (in the case of rearrangements) along with the portion of text they describe. In an extreme case, lose half the chapters and you'll be left with only about half the index terms; just those describing the surviving text. The resulting index may be unbalanced and might look rather odd to a trained evaluator but it will still point to the right sections of text.
In principle, although word processor-based EI software defaults to allocating page numbers, the embedding would be valid for display on a website or an ebook page: it's just a question of adding further software to render the resulting index appropriately. Thus EI indexes facilitate repurposing: index once; publish many times!

There is more on EI generally in James Lamb's blog and on XML embedding specifically in Michel Combs' article in The Indexer ('XML indexing', The Indexer, 30(1), 47-52, (March 2012)).
Hyperlinking
Hyperlinking is familiar to us all because it provides the connectivity of the Web. Being electronic, it's not of course applicable to hard copy indexes but it's great for ebooks. It works in different way from EI (discussed above) and tagging (below). Instead of inserting the index terms or a placeholder code, we can use the same markup language that renders the document to link the index to the text treatment. In the case of a web page, this will normally use two instances of the HTML 'anchor' tag, linked by matching attribute values at each end.

The rendering software will need to match the 'href' attribute of the clickable end of the link (the index term) to the 'name' attribute of the link target (the labelled text location), so they need a unique identifier. The entry can be set up to take the index user straight to a single text location or it can be configured so that several separate matching attributes would lead from location links to different text discussions. Clicking on the entry causes the relevant text to appear instantly at the top of the reader's screen.

Website indexing is a specialized area of activity at present, but potentially significant because eBooks are structurally similar to very large web pages. They are a continuous stream of text with no predetermined structure or breaks through which the reader slowly scrolls, and current eReader systems employ coding based on HTML or XHTML to render the book in reader-specified screen sizes.

Because eBooks have no fixed pagination, any page-based indexes prepared for their hard-copy versions will self-evidently be inappropriate. Nevertheless, these, complete with meaningless locators, are often supplied. Whether good or bad, ebook indexes are not always easily accessible. The fall-back seems (at least in the case of the Amazon Kindle, currently dominant in the UK market) to be keyword searching linked to automatically-generated 'locations' of perhaps a sentence or two, which is almost as precise as hyperlinking. The locations usually work well; it's the retrieval that fails the reader in the case of academic material, textbooks and all serious non-fiction.

More information on hyperlinking ebooks is in James Lamb's blog.
Tagging
Since EI appeared, publishers have developed several alternative options for indexing unpaginated documents. Three major companies, CUP, OUP and Elsevier, use a variant of what - in the UK at least - is loosely called 'tagging', an unfortunate term considering the specific use of the word 'tag' in markup languages. Unlike EI, tagging does involve supplying a separate index file but it still avoids page numbers. Instead it relies on inserting some kind of unique code (though not the index terms themselves) into the text at the places where indexed topics are discussed. These codes act as placeholders - temporary locators - in the accompanying index and are replaced (in the case of hard-copy publications) once page numbers have been fixed.

Positions in the text can be marked in two different ways: either a meaningful sequence number or an arbitrary code. The only requirement of any location code is that it should be unique in the text (and preferably able to be hidden). The first is probably faster, because a computer can assign section, paragraph, line or sentence numbers in advance, but it depends on a stable, definitive text. Using arbitrary codes requires the indexer to assign them (as do some EI extent designations that will result in page ranges) but (again like EI), because they have no positional significance, they will resist subsequent rearrangement of text elements, like swapping two sections around, which would undermine any sequence numbers.

There's more elsewhere about the crucial distinction between EI and tagging.