Follow

Just for the record, sed has been kicking my butt for the past few hours.

In which we find actual errors in the Library of Congress classification.

Well, ok, types. But I'm getting to the point where stuff jumps out.

There are about 3,600-odd notes on how to classify items, within the classifiction itself. And at KJ-KKZ2 144, Labour Courts and Procedures, we find:

"Class her works on courts of several jurisdictions". Rather than "Class _here_".

Not a biggie, but I found it.

The LCC law classification (K, generally) is one of the largest, most recent, and based on what I can tell, ideosyncratic and potentially buggy, bits of the classification.

Even just looking at consistency of tabs and spaces (indents ... _matter_ in the printed version) that was evident. A few other notations and conventions are also unusual.

@dredmorbius Is there some reason why you're using sed rather than a Popular Scripting Language™ like AWK, Ruby, Perl, Python, …?

@mathew I'm actually trying to reduce processing I'd been doing in awk to sed.

Sed ... _feels_ like it ought to work here. Part of this is just an exercise to see if I can wrap my head around it.

I'd managed a few bits earlier that had had me stumped for a bit.

It's also about developing a model in my head about the content I'm manipulating here, and seeing if that mapps to sed's capabilities.

Otherwise, my Ruby / Perl / Python fu are poor.

@kensanata @mathew OK, I seem to have got it.

The advantage is that the logic is now more streamlined, and more appropriate to (and hopefully, resiliant to changes in) the source file structure.

Now to see if I can spot any issues that have crept in....

65 lines of executable sed (nonblank / non-comment lines) to manage 13,000 source lines of text. That's the power.

@kensanata @mathew Issues ... spotted ; -)

(I'm doing some block-of-text level processing. Duplicate outputs are sneaking in.)

@dredmorbius

are you using some sort of automated tests to ward against regressions introduced by changes to your code?
e.g. some sample input and output files that should give the same expected outputs?

@FiXato The test case *is* the data.

These are converted "FreeLCC" Library of Congress Classification files. Originals are PDF. I've converted them to text, and needed to remove page-based artefacts, unwrap lines, get rid of a few bits, and tidy up some oddness within the source file(s).

So: a general process loop, sorting when I was at the end of a page-sized chunk (looking for form-feed and footer patterns), and dealing with a few special cases that cropped up to avoid hand-edits.

@FiXato TL;DR: there's no real spec.

Running some diffs / sdiffs of the raw text dump and the reprocessed file(s) give me most of the verifications. Stuff like line counts, element counts, and looking for wildly different output. It's looking good.

Source docs may be updated on an annual or less frequent basis.

@dredmorbius
but you are currently doing the verification of how well they converted, by hand, right? That was the automation part I meant. If you already do that, please ignore :)

@FiXato There's ... no straightforward way of doing that which I can think of, though if you have ideas, I'm all ears.

The document basically lists specific Library of Congress classifications (divisions), or ranges, along with a description.

There is a CLASS ([A-HJ-NP-VZ], Subclass ([A-HJ-NP-VZ][A-Z]{1,2}), division [0-9]{1,4}, and a "Cutter" classification [A-Z][0-9]{1,3}. Followed by a description (text, may include non-Roman characters, e.g., Arabic, Hebrew, Cyrillic, Chinese).

@FiXato The reason I'm doing this is because there's no accessible machine-readable form of this, so spot validation is ... difficult.

Though kicking out random values and confirming them can help.

Again, if you have a suggested method of vetting a process like this, I'd appreciate it.

@dredmorbius
I guess at the very least you could save the source and output files after you are satisfied with this batch, and then when you at a later stage edit your script again, you can run it again on that same set of source files and diff the new output filed against the source output files.
Then you'll at least know your changes won't have unintentionally broken a prefix conversion rule.

@dredmorbius Well, if this is a one-time task, I guess the tool you know is the one to use. And presumably performance isn't too much of an issue.

@mathew I'm building a system to uniformly transform a set of converted text files.

PDFs converted to text, then having a bunch of quirks fixed so that they can be uniformly parsed.

The larger dataset is about 600k lines. This is a smaller set of < 10k lines. The point is to be able to repeat processing (and update if needed as sources change) in future.

@dredmorbius Nothing over almost 40 years of working with computers compares with getting sed to do something for you. It’s like unlocking secrets from the Deep Ancient Ones that can’t be named.

@tsturm It's really not that bad. For basic search-and-replace, sed is truly capable.

I'm into branched statements, conditional processing, and append-to-hold-space processing, which tends to get a bit hairier.

@brennen @dredmorbius It's true. Turing Tar PIts are the most fun you'll ever have programming... but it might not be the wisest or easiest way to achieve goals. :)

@brennen Sed's domain is repeated textual transformations.

That's precisely what I'm using it for. It's only that the source text is longer and more ideosyncratic than most.

The realisation that _most_ of the problems I was encountering were based on the fact that the source assumed a paginated layout, and had various pagination-related artefacts, was a large part of the problem I was running into.

@tsturm

@brennen
Scoping ranges that matched the page boundary indicators pretty much solved all that, then it was on to the special cases.

For the most part, sed's peculiarities were just what I needed here, I just needed to be consciously aware of what they were and how they worked, relating to what I wanted and was processing.

@tsturm

@dredmorbius This is why I moved all that kind of work to my Filter.py tool; Python's regex is more current than sed's, and I can do more logic before and after processing a line. awk is fine, too, but also has an older regex system.

For a very long time I just wrote perl, but I'm not masochistic enough to keep doing that.

Sign in to participate in the conversation
mastodon.cloud

Generalistic and moderated instance.