does anyone have a good example / bit of code i can look over for using spark + pref python to iterate over a large number of HTTP calls?
@cm_harlow And their terms of service are OK with that?
Sorry, I've never done anything with Spark. Sounds like you're doing a huge job though. I wonder if the LoC could just give you a big data dump?
@ekansa terms of service for 1 of the services have a query time limit, others are open.
their data dumps are not the representation i need, a lossy representation at that, and nearly a year out of date.
and they're not willing to generate new data dumps for me at this moment and send me to the various request options.
(but i could pay a vendor for what i need)
@cm_harlow Wow! Seesh.
So you doing some sort of analysis on all these MARC records? Or is this to buildup your own data for retrieval services? Or something else.
Sorry, I can't help but I'm super intrigued by the scale of your project!
@ekansa heh, nw. i'm looking to retrieve a full dump of the Authorities to then serve up + manage in a few different ways - as a Git repo, as a ResourceSync source, as a IPFS repo.
@ekansa + perform a more granular conversion to RDF, enhance with reconciliation, + serve/publish in same mechanisms.
@ekansa the management aspect means checking on an Atom feed + other spots for notification of records added or updates, then pulling in less intensive fashion. But I need to get over the hump of a preliminary pull of all the data.
@ekansa for sure.
i'm basically doing whatever I can to get these datasets better published + shared. I want to explore what forking of large auth. datasets could look like but can't wait on LoC to move towards something other than Voyager / Z39.50 / SRU / MarkLogic for their data services
@cm_harlow Interesting. I've had trouble with really big Git repos before (memory issues), but you can probably divide into a number of smaller repos.
@ekansa i like what WOF is doing, find their kinda complete separation from other efforts in terms of data models a bit disconcerting though tbh
@cm_harlow That sounds super cool. Would be really interesting to link up with some other linked datasets, esp. gazetteers.
@ekansa for sure. geo data is one of the things that suffers the worst from the current LoC data dumps in RDF bc the conversion is lossy
Everyone is welcome as long as you follow our code of conduct! Thank you. Mastodon.cloud is maintained by Sujitech, LLC.