Follow

does anyone have a good example / bit of code i can look over for using spark + pref python to iterate over a large number of HTTP calls?

@ekansa ... yeah :-/ i'm trying to get a bunch of MARC records outta LoC via 4 diff api routes

@cm_harlow And their terms of service are OK with that?

Sorry, I've never done anything with Spark. Sounds like you're doing a huge job though. I wonder if the LoC could just give you a big data dump?

@ekansa terms of service for 1 of the services have a query time limit, others are open.

their data dumps are not the representation i need, a lossy representation at that, and nearly a year out of date.

and they're not willing to generate new data dumps for me at this moment and send me to the various request options.

(but i could pay a vendor for what i need)

@cm_harlow Wow! Seesh.

So you doing some sort of analysis on all these MARC records? Or is this to buildup your own data for retrieval services? Or something else.

Sorry, I can't help but I'm super intrigued by the scale of your project!

@ekansa heh, nw. i'm looking to retrieve a full dump of the Authorities to then serve up + manage in a few different ways - as a Git repo, as a ResourceSync source, as a IPFS repo.

@ekansa + perform a more granular conversion to RDF, enhance with reconciliation, + serve/publish in same mechanisms.

@ekansa the management aspect means checking on an Atom feed + other spots for notification of records added or updates, then pulling in less intensive fashion. But I need to get over the hump of a preliminary pull of all the data.

@ekansa for sure.

i'm basically doing whatever I can to get these datasets better published + shared. I want to explore what forking of large auth. datasets could look like but can't wait on LoC to move towards something other than Voyager / Z39.50 / SRU / MarkLogic for their data services

@cm_harlow Interesting. I've had trouble with really big Git repos before (memory issues), but you can probably divide into a number of smaller repos.

@ekansa i like what WOF is doing, find their kinda complete separation from other efforts in terms of data models a bit disconcerting though tbh

@cm_harlow wow! That sounds really awesome. For what resources? All the LoC or some?

@ekansa all the LoC Name + Subject Authorities - at least at first.

@cm_harlow That sounds super cool. Would be really interesting to link up with some other linked datasets, esp. gazetteers.

@ekansa for sure. geo data is one of the things that suffers the worst from the current LoC data dumps in RDF bc the conversion is lossy

@cm_harlow @ekansa This sounds awesome and potentially super useful. Are you doing this for private/internal use or public?

@ekansa @spellproof i got through maybe 40% of the NAF pushing it to GitHub but it's just taking forever with my dinky python + requests only scripts

@cm_harlow @spellproof Sheesh! But you won't have to repeat that will you? It's the initial batch right?

@spellproof @ekansa i mean, maintenance will be interesting but not this big ass mountain to climb over

@cm_harlow @spellproof

This will probably impact your normalization strategy. It would be a bummer to have to do a big update that impacts your whole repo.

@cm_harlow @ekansa I can imagine! Anything better is way over my head, but good luck!

Sign in to participate in the conversation
mastodon.cloud

Everyone is welcome as long as you follow our code of conduct! Thank you. Mastodon.cloud is maintained by Sujitech, LLC.