Inverting the Web 

We use search engines because the Web does not support accessing documents by anything other than URL. This puts a huge amount of control in the hands of the search engine company and those who control the DNS hierarchy.

Given that search engine companies can barely keep up with the constant barrage of attacks, commonly known as "SEO". intended to lower the quality of their results, a distributed inverted index seems like it would be impossible to build.

@freakazoid Shifting ground (and jumping back up this stack -- we've sorted the URL/URI bit):

What you suggest that's interesting to me is the notion of _self-description_ or _self-identity_ as an inherent document characteristic.

(Where a "document" is any fixed bag'o'bits: text, audio, image, video, data, code, binary, etc.)

Not metadata (name, path, URI).

*Maybe* a hash, though that's fragile.

What is _constant_ across formats?

@freakazoid So, for example:

I find a scanned-in book at the Internet Archive, I re-type the document myself (probably with typos) to create a Markdown source, and then generate PDF, ePub, and HTML formats.

What's the constant across these?

How could I, preferably programmatically, identify these as being the same, or at least, highly-related, documents?

MD5 / SHA-512 checksums will identify _files_, but not _relations between them_.

Can those relations be internalised intrinsically?

@freakazoid Or do you always have to maintain some external correspondence index which tells you that SOURCE.PDF was the basis for RETYPED.MD which then generated RETYPED.MD.ePub and RETYPED.MD.html, etc.

Something that will work across printed, re-typed, error/noise, whitespace variants. Maybe translations or worse.

Word vectors? A Makefile audit? Merkel trees, somehow?

@freakazoid Some sort of edit-distance calculation?

One that would be aware of things like document processors which ingest Markdown text and emit PDF, HTML, ePub, DOCX, etc., etc.?

@dredmorbius We have real world solutions for these problems in the form of notaries, court clerks, etc. I.e. (registered) witnesses. Trusted third parties, but they don't have to be a single party.

@dredmorbius In the RDF world I guess one doesn't sign the individual triple but the entire graph.

And it might make more sense to call these 4-tuples, because it's really "this person says that this object is related in this way to this other object".

@freakazoid Sorry, what's a triple in this context?

I've run across ... N-triples in an RDF / metadata context (via Worldcat -- it's one of their record schemas).

@dredmorbius Sorry, I thought you had used the term triple, but you actually used the term relation. I'm talking about triples in the RDF sense, which are relations.

@freakazoid So for 4-tuple:

1. Verifier
2. Object1.
3. Obejct2.
4. Obect1-Object-2 relation

"Signed" means that the whole statement is then cryptographically signed, making it an authenticatable statement?

@freakazoid Got it.

So in RDF: Subject - (Predicate) -> Object

"X relates to Y as Z".

As a 4-tuple:

"A _says_ that X relates to Y as Z".

Hash & sign, etc., etc.

@freakazoid And, so:

Back to search and Web:

- The actual URL and path matter to the browser.

- They may matter to me. Some RoboSpam site ripping off my blog posts _might_ leave the content unchanged, but they're still scamming web traffic, ads revenue, or reputation, based on false pretences. I want to read my content from my blog, not SpamSite, even if text and hashes match.

@freakazoid The URL and domain connote to _trust_ and a set of relationships that's not front-of-mind to the user, but _still matters_.

Content search alone fails to provide this. And some proxy for "who is providing this" -- who is the _authority_ represented as creator, editor, publisher, curator, etc. -- is what we're looking for. DNS and host-part of URL ... somewhat answer this.

(Also TLS certs, etc.)

@freakazoid In a physical library (or bookshop, etc.) there's a trust relation set by the physical location and structure, the administration and librarian(s), acquisitions, publishers, authors.

So Some Book listed within the catalogue is a Fair Representation of the canonical work.

@freakazoid I think, by the way, that this in part answers my question: is self-description possible.

No, it's not. _Some_ level of metadata (even if provided within the work itself) is necessary.

@dredmorbius FWIW word and phrase presence/frequency is self-description, in that it is verifiable without consulting a human. It's also useful for search, though it's generally not what humans care about directly even though it's what they search on; what they care about is the actual idea or thing they think documents having those words or phrases might be about.

@freakazoid Right.

I need to check on what state-of-the-art is, but based on tuples or ngrams of even short word sets (2-3, maybe 4), you can create an extensive signature of a text sampling within it. You can transform those to be constant against various modulations (e.g., ASCII7 vs. Unicode, whitespace, punctuation, ligatures, even common spelling variants/errors).

And then check an offered text against a known signature on a sampling of tuples through the doc.

This undoubtedly exists.

@dredmorbius They use techniques like this to detect plagiarism. You can compute something like a Bloom filter for a document and then use Hamming distance to compare. That can work well as long as one is not intentionally trying to defeat it.

Of course, that assumes raw text. Once you get into complex markup, the markup can change the meaning of the document without changing what a text extractor will see. And then there's higher-bandwidth media like images, audio, and viceo.

@freakazoid And for anyone following this:

I'm not an expert, though I'm interested in the area.

I feel like I'm staggering drunk in the dark. Some of what I'm describing is Things I Have Known for Five Minutes Longer Than You (or a few days). Some longer.

This is ... remote from most work I've done, though I've been kicking around ideas for a few years, and know at least _some_ of what I'm talking about.

Informed input / corrections welcomed.

@dredmorbius @freakazoid I'll state publicly that I appreciate the thinking going on in this thread!

I don't have any additional input to give.

@dredmorbius Regarding the ripping off of content, URLs only help with that to the extent that people pay attention to them, which they don't, even when typing in passwords and other secret information like credit card numbers.

@freakazoid What's the meatspace fix to this?

It mostly comes down to physical location. Though slipping something into the postal mail (or out, or phone calls, etc.) is an attack vector.

Are we simply outsourcing trust to search engines?

@dredmorbius @freakazoid
Ad revenue is basically a way to use the web's (accidental) dynamicism as a monetization strategy. If monetization were based on permission to access, you'd save on hosting costs if you *only* gave permission & whoever happened to be around did the hosting (like serving password-protected items off bittorrent and selling the passwords).

@enkiv2 @dredmorbius Of course, now instead of pirating big files people will just pirate the passwords ;-)

An ISP startup I worked for back in '96 (InterNex, later acquired by Concentric which renamed itself to XO Communications using one of Internex's domains for customers) tried to make something like this. It was essentially DRM for arbitrary content that used a .exe wrapper that contacted a license server. I don't think they ever managed to even bring it to makret.

@enkiv2 @dredmorbius Perhaps a better approach would be to separate funding and access entirely, the way it was for thousands of years?

@freakazoid Yes, this.

Another Brilliant Idea I had, to promptly discover far more able minds had arrived at it long before.


Finance it on tax-supported UBI, awards, grants, and bonuses, with supplemental income from performance and unit sales where appropriate.


@freakazoid @enkiv2 @dredmorbius
Right, I'm imagining a world without piracy. (It turns out that if it's easier to pay, the first world will generally just pay, and piracy becomes limited to folks who wouldn't pay anyway.) What I'm describing is xanadu 'transcopyright' though -- but transcopyright in xusp, xsp, oxu, & xuc is based on one time pads for subdivision reasons so it doesn't save you any bytes.

@enkiv2 @freakazoid @dredmorbius
It saves you someeee because links and formatting aren't encrypted and also because (since documents are static) nobody's re-fetching. And also since new versions transclude from the old they wouldn't need to fetch twice for an update (but also would only pay for updated characters...)

@enkiv2 @dredmorbius If it's easy to pay people will pay, but then there's also a strong encouragement to put stuff that would otherwise have been free behind paywalls, like we see in app stores. I don't think "no piracy" is the goal we should be looking for. It's maximum value for humanity from creativity.

@enkiv2 @dredmorbius Or to put it another way my goal is not to make sure that people pay to consume content but to make it so that people can make awesome stuff. A fixed payment per person or per use is about the crudest way I can think of to accomplish that. If anything it dramatically limits the utility of creativity, because even though it's nearly costless for additional people to benefit from it, unless they can or will pay the fixed price, they get nothing.

@enkiv2 @dredmorbius Likewise, there's a barrier to paying *more*. Especially since payment happens up front, before the payer has any idea what utility they will derive from the content. Far better to pay after the fact on a sliding scale. Sure, some will exploit that, and I think our aversion for that is what makes us accept such a shitty solution to begin with. But I think creators would get far more with such a model, especially since it helps eliminate middlemen

@enkiv2 @dredmorbius We know making it easy to pay reduces piracy, but we have never tried making it easy to pay after the fact, and especially not without middlemen. The current easy payment systems (Netflix, Spotify, etc) have huge inefficiencies even ignoring the "one price fits all" problem. It also leaves niche interests under- or fully un-served.

@enkiv2 @dredmorbius I also think a Xanadu-like system would breakdown quickly without people with guns to enforce it. It's too much complexity for too little gain. Effort to police violations would almost certainly exceed the amount of value for the vast majority of works. Just like it does when people steal small creators' videos on YouTube. So you'd have a system that at best would only benefit large content publishers. No thanks.

@freakazoid @enkiv2 @dredmorbius
Yeah, transcopyright relies on the existing (government-enforced) copyright & licensing mechanisms. It's a hack on top of that to streamline shit, just like the GPL. It's, in my view, the least shitty one beyond abolishing copyright entirely.

@freakazoid @enkiv2 @dredmorbius elementary OS is exploring good things this way in their AppCenter!

They heavily encourage, but do not require, payment & they provide a button for people to pay whenever they want.

@freakazoid @enkiv2 @dredmorbius interesting thread. If resources had a 'Suggested price' and consumption means 'Intention to pay' then afterwards payment could be below price w. e.g. max. 50% off (disappointed), on par or above price (cool stuff). Average payment then indicates 'Quality of resource': "N people payed X price". Consistently underpaying effects Reputation, risks losing access to resources.

@alcinnz @freakazoid @enkiv2 @dredmorbius
The minimum price, set as a percentage of suggested price, ensures creators at least earn something. The percentage could start low and dynamically increase (or decrease further) based on the value assigned by the consumers through their payment.

@humanetech @alcinnz @enkiv2 @dredmorbius I think the way to make sure creators earn something is to have something like a UBI, or otherwise make it so that one doesn't have to earn anything to live a dignified, healthy, happy, productive life. I think the only minimum price that makes sense is zero, because a) that's the cost of an additional copy, and b) there are a huge number of people who will benefit from a work who can't pay.

@humanetech @alcinnz @enkiv2 @dredmorbius Remember, the value to humanity of a creative work comes from its consumption. The value to any given individual is the difference between the value to them of consuming the work and the value of what they have to pay for it. The reason we enable creators to capture some of the value they produce is to incentivize them to create more. But we want them to create works with the most value to others.

@humanetech @alcinnz @enkiv2 @dredmorbius We also want people to create derivative works, and we want works that encourage derivation, because that multiplies their value. Excessive financial incentives derived from limiting access tend to reduce the amount of derivation by others, and it can cause excessive derivation by the creator who owns the original work in an effort to extract maximum value with minimum effort.

@humanetech @alcinnz @enkiv2 @dredmorbius So I think the optimal scenario is to create a culture of paying for creative works not based on the value one individually derives from them, but the value one feels humanity derives from them. And of paying at whatever point in time they can, not just right before or right after consuming it. It can be like tithing to the church, where the church is a global decentralized Patreon.

@freakazoid @humanetech @alcinnz @enkiv2 @dredmorbius

Or like a tax.

Which could also go towards other commonly-beneficial endeavors like research... and schooling... and ensuring the general welfare.

Imagine a government that supported all these things.

Show newer
@freakazoid @humanetech @alcinnz @enkiv2 @dredmorbius

A substantial part of our creative output is made with the express purpose of influencing others.

And, like this thread, not paid for by the reader.

If I may parafrase you:
The value to any given individual then is the difference between the value to them of getting others to consume the work and the value of what they have to pay for it.

@Jens @humanetech @dredmorbius @enkiv2 @alcinnz That's a really good point. I hadn't really been thinking about the value to the creator themselves of having others use their work. It especially gives an interesting perspective on Hollywood and the media since they have a lot of influence on our culture and politics directly through the works they produce.

@freakazoid @enkiv2 @dredmorbius
Well, the XU transcopyright model isn't globally fixed & there was the assumption that only a relatively small amount of content would actually be paywalled, but you're right that when an effective paywall exists there's incentive to put more behind it. The point was to set up distribution in such a way that less user-friendly DRM measures & stuff like individual takedowns couldn't be justified as easily.

@freakazoid Right: authorities, certifiers, validators, auditors.

Some may verify _contents_, many only verify _process_. Some do detailed forensics.

The end result is a distributed web of trust over a fact or artefact being what it appears or claims to be. Which isn't always correct, but increases costs (and risks) of deception.

That will probably be at least a part of the system(s) I'm cosidering. There's some underlying need for either external authority or distributed concensus.

@dredmorbius I'd say just publish all the claims about the data, and let each person, node, organization, etc, decide which witnesses/publishers to trust. With tools to make that as easy as possible, of course.

@freakazoid The problem with "just decide who to trust" is that it becomes combinatorially expensive quickly.

Back to the URL/URI issue, and looking at DNS again, there's the notion of a DNS search list -- a set of domains (or subdomains) searched preferentially for an unqualified hostname.

That's a useful though somewhat inflexible approach to "how do I assign nicknames to resources frequently used?"

We don't address people formally by full names + patronyms + SSN. We say "Hey, Sean".

@freakazoid And ... there are levels of locality _and_ generality in trust.

If waifu tells me "lamp is broken, switch is burnt out", we have a close relationship, and I'm inclined to belive her.

But when I get to the lamp store the tech says "no, that's wired in series, there's a bad bulb". I give more credence to the tech's knowledge _even if they've not inspected the lamp_ than waifu.

Trust is complex and contextual.

Sign in to participate in the conversation

Everyone is welcome as long as you follow our code of conduct! Thank you. is maintained by Sujitech, LLC.