Inverting the Web 

@freakazoid Shifting ground (and jumping back up this stack -- we've sorted the URL/URI bit):

What you suggest that's interesting to me is the notion of _self-description_ or _self-identity_ as an inherent document characteristic.

(Where a "document" is any fixed bag'o'bits: text, audio, image, video, data, code, binary, etc.)

Not metadata (name, path, URI).

*Maybe* a hash, though that's fragile.

What is _constant_ across formats?

@freakazoid So, for example:

I find a scanned-in book at the Internet Archive, I re-type the document myself (probably with typos) to create a Markdown source, and then generate PDF, ePub, and HTML formats.

What's the constant across these?

How could I, preferably programmatically, identify these as being the same, or at least, highly-related, documents?

MD5 / SHA-512 checksums will identify _files_, but not _relations between them_.

Can those relations be internalised intrinsically?

@freakazoid Or do you always have to maintain some external correspondence index which tells you that SOURCE.PDF was the basis for RETYPED.MD which then generated RETYPED.MD.ePub and RETYPED.MD.html, etc.

Something that will work across printed, re-typed, error/noise, whitespace variants. Maybe translations or worse.

Word vectors? A Makefile audit? Merkel trees, somehow?

@dredmorbius We have real world solutions for these problems in the form of notaries, court clerks, etc. I.e. (registered) witnesses. Trusted third parties, but they don't have to be a single party.

@dredmorbius In the RDF world I guess one doesn't sign the individual triple but the entire graph.

And it might make more sense to call these 4-tuples, because it's really "this person says that this object is related in this way to this other object".

@freakazoid So for 4-tuple:

1. Verifier
2. Object1.
3. Obejct2.
4. Obect1-Object-2 relation

"Signed" means that the whole statement is then cryptographically signed, making it an authenticatable statement?

@freakazoid And, so:

Back to search and Web:

- The actual URL and path matter to the browser.

- They may matter to me. Some RoboSpam site ripping off my blog posts _might_ leave the content unchanged, but they're still scamming web traffic, ads revenue, or reputation, based on false pretences. I want to read my content from my blog, not SpamSite, even if text and hashes match.

Follow

@freakazoid The URL and domain connote to _trust_ and a set of relationships that's not front-of-mind to the user, but _still matters_.

Content search alone fails to provide this. And some proxy for "who is providing this" -- who is the _authority_ represented as creator, editor, publisher, curator, etc. -- is what we're looking for. DNS and host-part of URL ... somewhat answer this.

(Also TLS certs, etc.)

@freakazoid In a physical library (or bookshop, etc.) there's a trust relation set by the physical location and structure, the administration and librarian(s), acquisitions, publishers, authors.

So Some Book listed within the catalogue is a Fair Representation of the canonical work.

@freakazoid I think, by the way, that this in part answers my question: is self-description possible.

No, it's not. _Some_ level of metadata (even if provided within the work itself) is necessary.

@dredmorbius FWIW word and phrase presence/frequency is self-description, in that it is verifiable without consulting a human. It's also useful for search, though it's generally not what humans care about directly even though it's what they search on; what they care about is the actual idea or thing they think documents having those words or phrases might be about.

@freakazoid Right.

I need to check on what state-of-the-art is, but based on tuples or ngrams of even short word sets (2-3, maybe 4), you can create an extensive signature of a text sampling within it. You can transform those to be constant against various modulations (e.g., ASCII7 vs. Unicode, whitespace, punctuation, ligatures, even common spelling variants/errors).

And then check an offered text against a known signature on a sampling of tuples through the doc.

This undoubtedly exists.

@dredmorbius They use techniques like this to detect plagiarism. You can compute something like a Bloom filter for a document and then use Hamming distance to compare. That can work well as long as one is not intentionally trying to defeat it.

Of course, that assumes raw text. Once you get into complex markup, the markup can change the meaning of the document without changing what a text extractor will see. And then there's higher-bandwidth media like images, audio, and viceo.

@freakazoid And for anyone following this:

I'm not an expert, though I'm interested in the area.

I feel like I'm staggering drunk in the dark. Some of what I'm describing is Things I Have Known for Five Minutes Longer Than You (or a few days). Some longer.

This is ... remote from most work I've done, though I've been kicking around ideas for a few years, and know at least _some_ of what I'm talking about.

Informed input / corrections welcomed.

@dredmorbius @freakazoid I'll state publicly that I appreciate the thinking going on in this thread!

I don't have any additional input to give.

Sign in to participate in the conversation
mastodon.cloud

[Notice Regarding the Transfer of the mstdn.jp / mastodon.cloud Services] We have received several inquiries showing interest in a transfer following the announcement of the end of the mstdn.jp and mastodon.cloud services. As a result of subsequently evaluating the situation and making preparations, we have decided that the corresponding services will be transferred to a company in the United States on June 30. We will make an announcement regarding the name of the company that the services will be transferred to once preparations have been made. Thank you.