Adventures in lexical analysis of a selected set of Debian manpages, preliminary conclusions:
The third-most-prevalent 4gram is "1 author joey hess"
@joeyh. do you have anything to say for yourself?
1. Roughly 9,000 of 13,480 manpages lexilysed.
The other major research finding is that lexing a file at a time via associative arrays is vastly faster than catting. all 76MB of the corpus. 23 minutes, vs a user-terminated process at 355 minutes, less than 50% complete.
There are 10,557,717 words in the set.
137,435 of those are distinct across the corpus.
(Lowercased, other normalisation applied, may include numbers.)
10 most common words: the, to, is, a, of, and, in, for, this, be, if, or, that, file, with, are, by, 1, it. as.
620,437 instances of 'the'., 225,990 of 'to', 54,996 of 'as'.
"Default" isn't the default word, butit's ranked 29 (37,150 instances) followed by "option" at 30 (36,755).
After about rank 25 it becomes fairly clear this is a technical corpus: set, value, options, string, sytem, perl, user, data (but user first!), linux, information, mode, format, sorce, directory....
I'm looking at 2grams, etc. Have to hand-copy terms to the tablet, so I'll stop live-blogging. Kind of fun though.
Top 2grams: of the, in the, to the. "this option" at #14 (13370) is the first, and0-indexed, 1337 term.
(I am seriously not making this up.)
3grams evenmore obviously technical. "linux programmer's manual", "about reporting bugs"
Trying to get a robust 3gram count, which is where awk and associative arrays bring this box to its knees. I may need to include only local-doc ngrams above some threshold....
From the bash manpage, something over 95% of 3grams are singletons.
Expanding that to 1000 manpages: 417,789 3grams, max in-doc freq: 95, mean, 1.173, 95%ile: 2
95 instances: "the value of" (from a subset).
Killed the 3gram tally at 28 minutes, trying in-file count> 1.