Inverting the Web 

We use search engines because the Web does not support accessing documents by anything other than URL. This puts a huge amount of control in the hands of the search engine company and those who control the DNS hierarchy.

Given that search engine companies can barely keep up with the constant barrage of attacks, commonly known as "SEO". intended to lower the quality of their results, a distributed inverted index seems like it would be impossible to build.

@freakazoid Shifting ground (and jumping back up this stack -- we've sorted the URL/URI bit):

What you suggest that's interesting to me is the notion of _self-description_ or _self-identity_ as an inherent document characteristic.

(Where a "document" is any fixed bag'o'bits: text, audio, image, video, data, code, binary, etc.)

Not metadata (name, path, URI).

*Maybe* a hash, though that's fragile.

What is _constant_ across formats?

@freakazoid So, for example:

I find a scanned-in book at the Internet Archive, I re-type the document myself (probably with typos) to create a Markdown source, and then generate PDF, ePub, and HTML formats.

What's the constant across these?

How could I, preferably programmatically, identify these as being the same, or at least, highly-related, documents?

MD5 / SHA-512 checksums will identify _files_, but not _relations between them_.

Can those relations be internalised intrinsically?

@freakazoid Or do you always have to maintain some external correspondence index which tells you that SOURCE.PDF was the basis for RETYPED.MD which then generated RETYPED.MD.ePub and RETYPED.MD.html, etc.

Something that will work across printed, re-typed, error/noise, whitespace variants. Maybe translations or worse.

Word vectors? A Makefile audit? Merkel trees, somehow?

@dredmorbius We have real world solutions for these problems in the form of notaries, court clerks, etc. I.e. (registered) witnesses. Trusted third parties, but they don't have to be a single party.

@freakazoid Right: authorities, certifiers, validators, auditors.

Some may verify _contents_, many only verify _process_. Some do detailed forensics.

The end result is a distributed web of trust over a fact or artefact being what it appears or claims to be. Which isn't always correct, but increases costs (and risks) of deception.

That will probably be at least a part of the system(s) I'm cosidering. There's some underlying need for either external authority or distributed concensus.

@dredmorbius I'd say just publish all the claims about the data, and let each person, node, organization, etc, decide which witnesses/publishers to trust. With tools to make that as easy as possible, of course.

@freakazoid The problem with "just decide who to trust" is that it becomes combinatorially expensive quickly.

Back to the URL/URI issue, and looking at DNS again, there's the notion of a DNS search list -- a set of domains (or subdomains) searched preferentially for an unqualified hostname.

That's a useful though somewhat inflexible approach to "how do I assign nicknames to resources frequently used?"

We don't address people formally by full names + patronyms + SSN. We say "Hey, Sean".

@freakazoid And ... there are levels of locality _and_ generality in trust.

If waifu tells me "lamp is broken, switch is burnt out", we have a close relationship, and I'm inclined to belive her.

But when I get to the lamp store the tech says "no, that's wired in series, there's a bad bulb". I give more credence to the tech's knowledge _even if they've not inspected the lamp_ than waifu.

Trust is complex and contextual.

Sign in to participate in the conversation

Everyone is welcome as long as you follow our code of conduct! Thank you. is maintained by Sujitech, LLC.