Inverting the Web
We use search engines because the Web does not support accessing documents by anything other than URL. This puts a huge amount of control in the hands of the search engine company and those who control the DNS hierarchy.
Given that search engine companies can barely keep up with the constant barrage of attacks, commonly known as "SEO". intended to lower the quality of their results, a distributed inverted index seems like it would be impossible to build.
@freakazoid Shifting ground (and jumping back up this stack -- we've sorted the URL/URI bit):
What you suggest that's interesting to me is the notion of _self-description_ or _self-identity_ as an inherent document characteristic.
(Where a "document" is any fixed bag'o'bits: text, audio, image, video, data, code, binary, etc.)
Not metadata (name, path, URI).
*Maybe* a hash, though that's fragile.
What is _constant_ across formats?
@freakazoid So, for example:
I find a scanned-in book at the Internet Archive, I re-type the document myself (probably with typos) to create a Markdown source, and then generate PDF, ePub, and HTML formats.
What's the constant across these?
How could I, preferably programmatically, identify these as being the same, or at least, highly-related, documents?
MD5 / SHA-512 checksums will identify _files_, but not _relations between them_.
Can those relations be internalised intrinsically?
@freakazoid Or do you always have to maintain some external correspondence index which tells you that SOURCE.PDF was the basis for RETYPED.MD which then generated RETYPED.MD.ePub and RETYPED.MD.html, etc.
Something that will work across printed, re-typed, error/noise, whitespace variants. Maybe translations or worse.
Word vectors? A Makefile audit? Merkel trees, somehow?
@dredmorbius We have real world solutions for these problems in the form of notaries, court clerks, etc. I.e. (registered) witnesses. Trusted third parties, but they don't have to be a single party.
@dredmorbius In the RDF world I guess one doesn't sign the individual triple but the entire graph.
And it might make more sense to call these 4-tuples, because it's really "this person says that this object is related in this way to this other object".
@freakazoid So for 4-tuple:
4. Obect1-Object-2 relation
"Signed" means that the whole statement is then cryptographically signed, making it an authenticatable statement?
@freakazoid And, so:
Back to search and Web:
- The actual URL and path matter to the browser.
- They may matter to me. Some RoboSpam site ripping off my blog posts _might_ leave the content unchanged, but they're still scamming web traffic, ads revenue, or reputation, based on false pretences. I want to read my content from my blog, not SpamSite, even if text and hashes match.
Ad revenue is basically a way to use the web's (accidental) dynamicism as a monetization strategy. If monetization were based on permission to access, you'd save on hosting costs if you *only* gave permission & whoever happened to be around did the hosting (like serving password-protected items off bittorrent and selling the passwords).
An ISP startup I worked for back in '96 (InterNex, later acquired by Concentric which renamed itself to XO Communications using one of Internex's domains for customers) tried to make something like this. It was essentially DRM for arbitrary content that used a .exe wrapper that contacted a license server. I don't think they ever managed to even bring it to makret.
@freakazoid Yes, this.
Another Brilliant Idea I had, to promptly discover far more able minds had arrived at it long before.
INFORMATION IS A PUBLIC GOOD. PROVIDE IT AS SUCH.
Finance it on tax-supported UBI, awards, grants, and bonuses, with supplemental income from performance and unit sales where appropriate.
Everyone is welcome as long as you follow our code of conduct! Thank you. Mastodon.cloud is maintained by Sujitech, LLC.