Just for the record, sed has been kicking my butt for the past few hours.

In which we find actual errors in the Library of Congress classification.

Well, ok, types. But I'm getting to the point where stuff jumps out.

There are about 3,600-odd notes on how to classify items, within the classifiction itself. And at KJ-KKZ2 144, Labour Courts and Procedures, we find:

"Class her works on courts of several jurisdictions". Rather than "Class _here_".

Not a biggie, but I found it.

Show thread

The LCC law classification (K, generally) is one of the largest, most recent, and based on what I can tell, ideosyncratic and potentially buggy, bits of the classification.

Even just looking at consistency of tabs and spaces (indents ... _matter_ in the printed version) that was evident. A few other notations and conventions are also unusual.

Show thread

@dredmorbius Is there some reason why you're using sed rather than a Popular Scripting Language™ like AWK, Ruby, Perl, Python, …?

@mathew I'm actually trying to reduce processing I'd been doing in awk to sed.

Sed ... _feels_ like it ought to work here. Part of this is just an exercise to see if I can wrap my head around it.

I'd managed a few bits earlier that had had me stumped for a bit.

It's also about developing a model in my head about the content I'm manipulating here, and seeing if that mapps to sed's capabilities.

Otherwise, my Ruby / Perl / Python fu are poor.

@dredmorbius Well, if this is a one-time task, I guess the tool you know is the one to use. And presumably performance isn't too much of an issue.

@mathew I'm building a system to uniformly transform a set of converted text files.

PDFs converted to text, then having a bunch of quirks fixed so that they can be uniformly parsed.

The larger dataset is about 600k lines. This is a smaller set of < 10k lines. The point is to be able to repeat processing (and update if needed as sources change) in future.

@dredmorbius Nothing over almost 40 years of working with computers compares with getting sed to do something for you. It’s like unlocking secrets from the Deep Ancient Ones that can’t be named.

@tsturm It's really not that bad. For basic search-and-replace, sed is truly capable.

I'm into branched statements, conditional processing, and append-to-hold-space processing, which tends to get a bit hairier.

@brennen @dredmorbius It's true. Turing Tar PIts are the most fun you'll ever have programming... but it might not be the wisest or easiest way to achieve goals. :)

@brennen Sed's domain is repeated textual transformations.

That's precisely what I'm using it for. It's only that the source text is longer and more ideosyncratic than most.

The realisation that _most_ of the problems I was encountering were based on the fact that the source assumed a paginated layout, and had various pagination-related artefacts, was a large part of the problem I was running into.


Scoping ranges that matched the page boundary indicators pretty much solved all that, then it was on to the special cases.

For the most part, sed's peculiarities were just what I needed here, I just needed to be consciously aware of what they were and how they worked, relating to what I wanted and was processing.


@dredmorbius This is why I moved all that kind of work to my Filter.py tool; Python's regex is more current than sed's, and I can do more logic before and after processing a line. awk is fine, too, but also has an older regex system.

For a very long time I just wrote perl, but I'm not masochistic enough to keep doing that.

Sign in to participate in the conversation

Everyone is welcome as long as you follow our code of conduct! Thank you. Mastodon.cloud is maintained by Sujitech, LLC.