does anyone have a good example / bit of code i can look over for using spark + pref python to iterate over a large number of HTTP calls?

@ekansa ... yeah :-/ i'm trying to get a bunch of MARC records outta LoC via 4 diff api routes

@cm_harlow And their terms of service are OK with that?

Sorry, I've never done anything with Spark. Sounds like you're doing a huge job though. I wonder if the LoC could just give you a big data dump?

@ekansa terms of service for 1 of the services have a query time limit, others are open.

their data dumps are not the representation i need, a lossy representation at that, and nearly a year out of date.

and they're not willing to generate new data dumps for me at this moment and send me to the various request options.

(but i could pay a vendor for what i need)

@cm_harlow Wow! Seesh.

So you doing some sort of analysis on all these MARC records? Or is this to buildup your own data for retrieval services? Or something else.

Sorry, I can't help but I'm super intrigued by the scale of your project!

@ekansa heh, nw. i'm looking to retrieve a full dump of the Authorities to then serve up + manage in a few different ways - as a Git repo, as a ResourceSync source, as a IPFS repo.

@ekansa + perform a more granular conversion to RDF, enhance with reconciliation, + serve/publish in same mechanisms.

@ekansa the management aspect means checking on an Atom feed + other spots for notification of records added or updates, then pulling in less intensive fashion. But I need to get over the hump of a preliminary pull of all the data.

@ekansa for sure.

i'm basically doing whatever I can to get these datasets better published + shared. I want to explore what forking of large auth. datasets could look like but can't wait on LoC to move towards something other than Voyager / Z39.50 / SRU / MarkLogic for their data services

@cm_harlow Interesting. I've had trouble with really big Git repos before (memory issues), but you can probably divide into a number of smaller repos.

@ekansa i like what WOF is doing, find their kinda complete separation from other efforts in terms of data models a bit disconcerting though tbh

@cm_harlow wow! That sounds really awesome. For what resources? All the LoC or some?

@ekansa all the LoC Name + Subject Authorities - at least at first.

@cm_harlow That sounds super cool. Would be really interesting to link up with some other linked datasets, esp. gazetteers.

@ekansa for sure. geo data is one of the things that suffers the worst from the current LoC data dumps in RDF bc the conversion is lossy

@cm_harlow @ekansa This sounds awesome and potentially super useful. Are you doing this for private/internal use or public?

@ekansa @spellproof i got through maybe 40% of the NAF pushing it to GitHub but it's just taking forever with my dinky python + requests only scripts

@cm_harlow @spellproof Sheesh! But you won't have to repeat that will you? It's the initial batch right?

@spellproof @ekansa i mean, maintenance will be interesting but not this big ass mountain to climb over

@cm_harlow @spellproof

This will probably impact your normalization strategy. It would be a bummer to have to do a big update that impacts your whole repo.

@cm_harlow @ekansa I can imagine! Anything better is way over my head, but good luck!

