Dr. Edward Morbius ⭕ is a user on mastodon.cloud. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.

500px will no longer allow photographers to license their photos under Creative Commons

"Photography platform 500px will no longer allow photographers to license their photos under a Creative Commons license, and is removing the functionality to search and download such images. The site also closed down its stock photo platform, 500px Marketplace yesterday, replacing it with distribution partnerships with Getty Images and Visual China Group...."

theverge.com/2018/7/1/17521456

Isabelle Allende with Studs Terkel, September 9, 1979, on Chile and fascism.

Parallels to now.

s3.amazonaws.com/wfmt-studs-te

Expanding that to 1000 manpages: 417,789 3grams, max in-doc freq: 95, mean, 1.173, 95%ile: 2

95 instances: "the value of" (from a subset).

Killed the 3gram tally at 28 minutes, trying in-file count> 1.

Trying to get a robust 3gram count, which is where awk and associative arrays bring this box to its knees. I may need to include only local-doc ngrams above some threshold....

From the bash manpage, something over 95% of 3grams are singletons.

I'm looking at 2grams, etc. Have to hand-copy terms to the tablet, so I'll stop live-blogging. Kind of fun though.

Top 2grams: of the, in the, to the. "this option" at #14 (13370) is the first, and0-indexed, 1337 term.

(I am seriously not making this up.)

3grams evenmore obviously technical. "linux programmer's manual", "about reporting bugs"

"Default" isn't the default word, butit's ranked 29 (37,150 instances) followed by "option" at 30 (36,755).

After about rank 25 it becomes fairly clear this is a technical corpus: set, value, options, string, sytem, perl, user, data (but user first!), linux, information, mode, format, sorce, directory....

There are 10,557,717 words in the set.

137,435 of those are distinct across the corpus.

(Lowercased, other normalisation applied, may include numbers.)

10 most common words: the, to, is, a, of, and, in, for, this, be, if, or, that, file, with, are, by, 1, it. as.

620,437 instances of 'the'., 225,990 of 'to', 54,996 of 'as'.

The other major research finding is that lexing a file at a time via associative arrays is vastly faster than catting. all 76MB of the corpus. 23 minutes, vs a user-terminated process at 355 minutes, less than 50% complete.

@joeyh This beats out a contending phrase, "gnu general public license"

Adventures in lexical analysis of a selected set[1] of Debian manpages, preliminary conclusions:

The third-most-prevalent 4gram is "1 author joey hess"

@joeyh. do you have anything to say for yourself?

Notes:
1. Roughly 9,000 of 13,480 manpages lexilysed.

In The Sinister Way, Richard von Glahn examines the emergence and evolution of the Wutong cult within the larger framework of the historical development of Chinese popular or vernacular religion—as opposed to institutional religions such as Buddhism or Daoism. Von Glahn's study, spanning three millennia, gives due recognition to the morally ambivalent and demonic aspects of divine power within the common Chinese religious culture.

2/end/

The Sinister Way
The Divine and the Demonic in Chinese Religious Culture

The most striking feature of Wutong, the preeminent God of Wealth in late imperial China, was the deity's diabolical character. Wutong was perceived not as a heroic figure or paragon of noble qualities but rather as an embodiment of humanity's basest vices, greed and lust, a maleficent demon who preyed on the weak and vulnerable.

ucpress.edu/book/9780520234086

1/

Anything but original:

duckduckgo.com/?q="50%27s+Shades+of+Grey"&atb=v64-5_a&t=cros&ia=web

I'm thinking of writing a mid-century S&M romance set in Eisenhower-era suburbia.

50's Shades of Grey

@aparrish I also took a different, entirely hackish approach using overall corpus freq & docu freq for the ngram plus randomly inserted logs for controlling scale.

I need to make that saner, though it reveals useful info.

Random textual analysis Q as my awk hack stresses swap and disk controllers: are there any off-the-shelf, tuple-like or other semantic significance tools out there?

Effectively: within a corpus, identfy terms (words, 2-3grams, probably) that are significant for distinguishing a specific member.

I'm cranking out word freqs and 2-5grams. Usefulness of the latter seems to fall off after 3 terms.

My weighted freq algo (homebrew) probably won't stand to rigour, useful approaches sought.

~1 million Creative Commons images on 500px.com are dissapearing tomorrow!

If you have the resources please install the Archiveteam's Warrior program and select the 500px project! archiveteam.org/index.php?titl

irc is #500pieces on efnet

thank you (boosts very appreciated)

If listicles are good enough for Euclid and Ludwig von Wittgenstein, they're good enough for me