11.82 bpw clarification (was Re: ssh-keygen(1) misinfo: English prose entropy 0.6 - 1.3 b/char!)
auto92089 at hushmail.com
auto92089 at hushmail.com
Tue Jun 5 15:06:38 EST 2001
>Trolling is such fun, isn't it?
I was trying to making merriment, not to provoke emotional responses.
I am conflicted on publicly responding, but decided to clarify just in
case your confusion is shared by others with better impulse control.
Unfortunately, an unchallenged statement made in the company or
under the scrutiny of experts is often assumed true, which is really
the original problem I sought to fix!
I notice that the maintainers have quietly fixed the problem
in the web-accessible ssh-keygen manpage and presumably
latest code - thanks. One shudders at how many crypto-enthusiasts
have read that page and failed to notice or correct it!
>Entropy rate does not mean that
If you read the manpage which I quoted for your convenience,
it said "1-2 bits of entropy per word". Nowhere has anyone
mentioned "rate". Even so, a rate is the relation between two measurements,
and we could easily have bits per character or bits per word, depending
on the definition of "symbol" used. As such, my interpretation (and I suspect
many other people's) is that the manpage was describing the entropy of
an entire word (...given the word's preceeding context).
I suspect the original author of the manpage is from Finland also;
perhaps the confusion is due to some language difference.
>if you are shown an English word,
>character by character, how long does it take to guess the whole word?
>Try a couple of them:
>moro_
>idio_
>trol_
What you are alluding to is the conditional probability of the
(in your examples, last) character dependent on the preceeding
ones. While that is an interesting measurement, it's not the one
under discussion, and has units of "bits per character" (or letter).
http://cs.fit.edu/~mmahoney/dissertation/entropy1.html
"Refining the Estimated Entropy of English by Shannon Game Simulation"
A reasonable primer on Shannon's estimates of entropy.
http://www.stanford.edu/~vjsriniv/project/entropy_of_english_9.htm
Covers both of Shannon's methods for estimating H for English
(by character using n-grams, by word using Zipf's relations)
http://www.voynich.nu/wordent.html
"From digraph entropy to word entropy in the Voynich MS"
This is cited here because it relates bits per letter
and bits per word by the trivial relation of summing,
and because it's very cool. Note that for comparisons of
different lengths you need to pad each word with a
word-end symbol, the first of which may add
some entropy (as in "word" vs. "words").
There's a ton of interesting articles which pop up after searching
for "entropy" and "shannon" and ilk, but I must stop somewhere and
introduce the coup de grace, courtesy of some obscure man who
worked for the telephone company. Note that this is the independent
or 0-order entropy; I was unable to find a handy reference with contextual
word entropy estimates, but clearly is far from 1-2 bpw:
"Prediction and Entropy of Printed English",
by C.E. Shannon
Bell Syst. Tech. J. 30:50-64 1951
"...Using logarithmic scales for probability and rank, the curve
approximates a straight line with slope -1; this, if p_n is the
probability of the nth most frequent word, we have, roughly
p_n = 0.1/n .
Zipf has pointed out that this type of formula, p_n = k/n, gives a
rather good approximation to the word probabilities in many different
languages. The formula [above] clearly can not hold indefinitely since
the total probablility SUM p_n must be unity, while
SUM_1^inf 0.1/n
is infinite. If we assume (in the absence of any better estimate) that
the
formula
p_n = 0.1/n
holds out to the n at which the total probability is unity, and that
p_n = 0 for larger n; we find that the critical n is the word rank 8,727.
The entropy is then:
-SUM_1^8727 p_n log_2 p_n = 11.82 bits per word."
YHNBT. YHL. HAND.
I, Zone Lee
Free, encrypted, secure Web-based email at www.hushmail.com
More information about the openssh-unix-dev
mailing list