The mysterious disappearance of "".
by Mathra
Disappearance of ­ would have made so much better a title, but it is not
­ that has disappeared. The entity, ­ (a.k.a. &#173 or &#xAD), is still
with us, and it serves well to
remind us of a browser- bug that appears to have achieved standards- compliance.
What has disappeared is the glyph that is associated with &#173 _ a hyphen
in the Latin character set. But that is not the bug.
Let us begin with the fragment,
"rub-a-dub-dub".
A quick scan of the first half- dozen pages of Google- hits for that turned up about
equal numbers of "rub-a-dub-dub"s and "rub a dub dub"s, one "rub.a.dub.dub.com", and no
"rubadubdub"s. (A search for "rubadubdub" turned up quite a few pages of those too.)
Discounting the ones with " "s and "."s, we have eight different four- syllable fragments and
at least 64 outcomes for browsers to sort out. The extreme cases are:
- "rub-a-dub-dub" : never break this word after a "-".
- "rub-a-dub-dub" : if appropriate, break this word after a "-".
- "rubadubdub" : never break this word between syllables.
- "rubadubdub" : if appropriate, break this word between syllables.
In the last case the desired outcome has a "-" before the break, but it is not clear whether
that "-" should be associated with &#45 or &#173. If it is the latter that was
intended, we'll just have to find the glyph for &#173 and put it back before the
hyphenation bug can be fixed.
The purpose
of the examples was merely to indicate that hyphenation is a process that involves
rules and logic _ in HTML, it is markup.
Clues to why or how &#173's glyph disappeared can be found in an examination of the W3C
specifications for HTML. Here is some of what appears in HTML 2:
Data Characters
Any sequence of characters that do not constitute markup (see 9.6 "Delimiter
Recognition" of [SGML]) are mapped directly to strings of data characters. Some
markup also maps to data character strings. Numeric character references map to
single-character strings, via the document character set. Each reference to one of the
general entities defined in the HTML DTD maps to a single-character string.
Evidently, markup was markup, and characters that weren't part of markup
were usually glyphs, back in the old days. A little more info appears in
REFERENCE DESCRIPTION
-------------- -----------
� -  Unused
	 Horizontal tab
Line feed
 -  Unused
Carriage Return
 -  Unused
-------------- -----------
­ Soft hyphen
-------------- -----------
At that time, it was intended that the glyphless characters, 	, 
 and 
,
would be used only in the PRE, XMP and LISTING elements, and nothing was said about ­.
But note the name of ­ _ Soft hyphen : that,
I assert, is the bug that
has crept in among the vacuum tubes, and it probably crept in long before the
writing of HTML 2.
Later, in HTML 3.2, a few of the references to "character" are:
PRE preformatted text
Requires start and end tags. These elements are rendered with a monospaced font and preserve
layout defined by whitespace and line break characters.
Character Entities for ISO Latin-1
- - - - - - - - - - - - - - -
<!ENTITY not CDATA "¬" -- not sign -->
<!ENTITY shy CDATA "­" -- soft hyphen -->
<!ENTITY reg CDATA "®" -- registered sign -->
- - - - - - - - - - - - - - -
Here, &#173 is still associated with CDATA in the standards, but
it is not clear whether its glyph should be rendered or not in implementations.
The bug has probably died by now, but the unexpected malfunctions
haven't started happening yet.
Finally, in HTML 4.01, most of the info about line-breaking and
hyphenation (outside of PRE, XMP and LISTING elements) is:
The BR element forcibly breaks (ends) the current line of text.
Prohibiting a line break
Sometimes authors may want to prevent a line break from occurring between two words.
The entity (  or  ) acts as a space where user agents should not
cause a line break.
9.3.3 Hyphenation
In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen
should be interpreted by a user agent as just another character. The soft hyphen tells the user agent
where a line break can occur.
Those browsers that interpret soft hyphens must observe the following semantics: If a line
is broken at a soft hyphen, a hyphen character must be displayed at the end of the first
line. If a line is not broken at a soft hyphen, the user agent must not display a hyphen
character. For operations such as searching and sorting, the soft hyphen should always be
ignored.
In HTML, the plain hyphen is represented by the "-" character (&#45; or &#x2D;). The soft hyphen is
represented by the character entity reference ­ (&#173; or &#xAD;)
The bug is dead _ long live the bug.
What appears to have happened is somewhat like synecdoche, but not quite the same.
Reference to the whole by the name of a part _ all hands on deck
_ is usually an entertaining figure
of speech, and it is usually done on purpose. The use of soft hyphen to refer to
hyphenation is merely soft thought _ a bit of linguistic driftwood that
has washed ashore so slowly that
nobody could be blamed for failing to notice it. It is an accident that has placed
an intolerable burden on poor little ­ _ no wonder it hardly ever shows itself
any more. Roughly stated, the argument has been that authors who wish to anticipate the imminent
implementation of hyphenation with ­ would not like to see all those unwanted hyphens
that wouldn't go away until the task was completed. (HTML 4.0 appeared about six-
and- a- half years ago.)
Consider something like "ruba-dubdub" again: the kinds of outcome an author might like to
specify include:
- ruba-dubdub no breaks allowed anywhere
- rub<?hb>a-dubdub allow (hyphen followed by break)
- ruba<h?b>dubdub hyphen followed by (allow break)
- ruba-dub<?hb>dub allow (hyphen followed by break)
Note the simplicity that follows from the introduction of the two bogus tags:
after all, hyphenation is markup. Of course there are other ways to do the
tasks, such as:
» <h?b> => -​ a workaround that might work
» <?hb> => ­ but cf. 	, 
 and 
These are my suggestions:
» W3C should rewrite §9.3.3 of HTML 4, and they should give ­ a better break.
» Purveyors of browsers should put &#173's glyph back.
» W3C should consider outlining standards for hyphenation- as- markup.
» Purveyors should consider implementing the enhancement to HTML.
And the implementations shouldn't be difficult ! The code that decides when a break is allowed
was done long ago. Browsers all seem to be able to ignore <i>, <b> and other tags
within words when
asked to "find in this page", but they do differ in the treatment of things like
"some<br>thing". That would be a matter where a recommended standard would help. My intuitive
feeling is that a search for "something" should never find "some<br>thing", but I am
still undecided about whether a search for "some-thing" should also find "some thing",
"some.thing", and perhaps others. As for
hyphenation in PRE, XMP or LISTING elements, the tags could be interpreted in PRE, but not in
the others _ a useful distinction.
About automatic hyphenation, be it done with dictionaries (too big) or by algorithms (too dicey), perhaps
the less said the better. If it were to be introduced, however, it should be
treated as a matter of style :
authors should have directives that specify whether things like
autohyph are on or off, and users
should have clear and simple ways to override authors' directives.
(Maybe finding variants of "some-thing" could be treated as style too _
Google:on , anyone?)