The mysterious disappearance of "".

by Mathra

Disappearance of &shy would have made so much better a title, but it is not &shy that has disappeared. The entity, &shy (a.k.a. &#173 or &#xAD), is still with us, and it serves well to remind us of a browser- bug that appears to have achieved standards- compliance. What has disappeared is the glyph that is associated with &#173 _ a hyphen in the Latin character set. But that is not the bug.

Let us begin with the fragment,

"rub-a-dub-dub".

A quick scan of the first half- dozen pages of Google- hits for that turned up about equal numbers of "rub-a-dub-dub"s and "rub a dub dub"s, one "rub.a.dub.dub.com", and no "rubadubdub"s. (A search for "rubadubdub" turned up quite a few pages of those too.) Discounting the ones with " "s and "."s, we have eight different four- syllable fragments and at least 64 outcomes for browsers to sort out. The extreme cases are:

"rub-a-dub-dub" : never break this word after a "-".
"rub-a-dub-dub" : if appropriate, break this word after a "-".
"rubadubdub" : never break this word between syllables.
"rubadubdub" : if appropriate, break this word between syllables.

In the last case the desired outcome has a "-" before the break, but it is not clear whether that "-" should be associated with &#45 or &#173. If it is the latter that was intended, we'll just have to find the glyph for &#173 and put it back before the hyphenation bug can be fixed.

The purpose of the examples was merely to indicate that hyphenation is a process that involves rules and logic _ in HTML, it is markup. Clues to why or how &#173's glyph disappeared can be found in an examination of the W3C specifications for HTML. Here is some of what appears in HTML 2:

Data Characters

Any sequence of characters that do not constitute markup (see 9.6 "Delimiter Recognition" of [SGML]) are mapped directly to strings of data characters. Some markup also maps to data character strings. Numeric character references map to single-character strings, via the document character set. Each reference to one of the general entities defined in the HTML DTD maps to a single-character string.

Evidently, markup was markup, and characters that weren't part of markup were usually glyphs, back in the old days. A little more info appears in

        REFERENCE       DESCRIPTION
        --------------  -----------
        &#00; - &#08;   Unused
        &#09;           Horizontal tab
        &#10;           Line feed
        &#11; - &#12;   Unused
        &#13;           Carriage Return
        &#14; - &#31;   Unused
        --------------  -----------
        &#173;          Soft hyphen
        --------------  -----------

At that time, it was intended that the glyphless characters, &#09, &#10 and &#13, would be used only in the PRE, XMP and LISTING elements, and nothing was said about &#173. But note the name of &#173 _ Soft hyphen : that, I assert, is the bug that has crept in among the vacuum tubes, and it probably crept in long before the writing of HTML 2.

Later, in HTML 3.2, a few of the references to "character" are:

PRE preformatted text
Requires start and end tags. These elements are rendered with a monospaced font and preserve layout defined by whitespace and line break characters.

Character Entities for ISO Latin-1
 
     - - - - - - - - - - - - - - -

<!ENTITY not    CDATA "&#172;" -- not sign -->

<!ENTITY shy    CDATA "&#173;" -- soft hyphen -->

<!ENTITY reg    CDATA "&#174;" -- registered sign -->
     - - - - - - - - - - - - - - -

Here, &#173 is still associated with CDATA in the standards, but it is not clear whether its glyph should be rendered or not in implementations. The bug has probably died by now, but the unexpected malfunctions haven't started happening yet.

Finally, in HTML 4.01, most of the info about line-breaking and hyphenation (outside of PRE, XMP and LISTING elements) is:

The BR element forcibly breaks (ends) the current line of text.

Prohibiting a line break

Sometimes authors may want to prevent a line break from occurring between two words. The   entity (  or  ) acts as a space where user agents should not cause a line break.

9.3.3 Hyphenation

In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.

Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

In HTML, the plain hyphen is represented by the "-" character (- or -). The soft hyphen is represented by the character entity reference  ( or )

The bug is dead _ long live the bug.

What appears to have happened is somewhat like synecdoche, but not quite the same. Reference to the whole by the name of a part _ all hands on deck _ is usually an entertaining figure of speech, and it is usually done on purpose. The use of soft hyphen to refer to hyphenation is merely soft thought _ a bit of linguistic driftwood that has washed ashore so slowly that nobody could be blamed for failing to notice it. It is an accident that has placed an intolerable burden on poor little &shy _ no wonder it hardly ever shows itself any more. Roughly stated, the argument has been that authors who wish to anticipate the imminent implementation of hyphenation with &#173 would not like to see all those unwanted hyphens that wouldn't go away until the task was completed. (HTML 4.0 appeared about six- and- a- half years ago.)

Consider something like "ruba-dubdub" again: the kinds of outcome an author might like to specify include:


 ruba-dubdub           no breaks allowed anywhere
 rub<?hb>a-dubdub      allow (hyphen followed by break)
 ruba<h?b>dubdub       hyphen followed by (allow break)
 ruba-dub<?hb>dub      allow (hyphen followed by break)

Note the simplicity that follows from the introduction of the two bogus tags: after all, hyphenation is markup. Of course there are other ways to do the tasks, such as:



 » <h?b> => -&#x200B    a workaround that might work

                        

 » <?hb> => &#173       but cf. &#09, &#10 and &#13

These are my suggestions:

» W3C should rewrite §9.3.3 of HTML 4, and they should give &#173 a better break.

» Purveyors of browsers should put &#173's glyph back.

» W3C should consider outlining standards for hyphenation- as- markup.

» Purveyors should consider implementing the enhancement to HTML.

And the implementations shouldn't be difficult ! The code that decides when a break is allowed was done long ago. Browsers all seem to be able to ignore <i>, <b> and other tags within words when asked to "find in this page", but they do differ in the treatment of things like "some<br>thing". That would be a matter where a recommended standard would help. My intuitive feeling is that a search for "something" should never find "some<br>thing", but I am still undecided about whether a search for "some-thing" should also find "some thing", "some.thing", and perhaps others. As for hyphenation in PRE, XMP or LISTING elements, the tags could be interpreted in PRE, but not in the others _ a useful distinction.

About automatic hyphenation, be it done with dictionaries (too big) or by algorithms (too dicey), perhaps the less said the better. If it were to be introduced, however, it should be treated as a matter of style : authors should have directives that specify whether things like autohyph are on or off, and users should have clear and simple ways to override authors' directives. (Maybe finding variants of "some-thing" could be treated as style too _ Google:on , anyone?)

The mysterious disappearance of "­".

"rub-a-dub-dub".

The mysterious disappearance of "".