@joeyh This beats out a contending phrase, "gnu general public license"
There are 10,557,717 words in the set.
137,435 of those are distinct across the corpus.
(Lowercased, other normalisation applied, may include numbers.)
10 most common words: the, to, is, a, of, and, in, for, this, be, if, or, that, file, with, are, by, 1, it. as.
620,437 instances of 'the'., 225,990 of 'to', 54,996 of 'as'.
"Default" isn't the default word, butit's ranked 29 (37,150 instances) followed by "option" at 30 (36,755).
After about rank 25 it becomes fairly clear this is a technical corpus: set, value, options, string, sytem, perl, user, data (but user first!), linux, information, mode, format, sorce, directory....
I'm looking at 2grams, etc. Have to hand-copy terms to the tablet, so I'll stop live-blogging. Kind of fun though.
Top 2grams: of the, in the, to the. "this option" at #14 (13370) is the first, and0-indexed, 1337 term.
(I am seriously not making this up.)
3grams evenmore obviously technical. "linux programmer's manual", "about reporting bugs"
@dredmorbius mostly that I sometimes read a man page and am surprised to find that line in it
@joeyh It's in 45 manpages, so far. More than any other 4gram save "10 of the linux" and "1 general commands manual".
Beating the GPL is pretty impressive.
Generalistic and moderated instance.