The challenge of curation
Last year, right after the OLPC SF Community Summit 2011, I had the pleasure of attending Books in Broswers (BiB-II) at the Internet Archive. It was a plan made with SJ on-the-fly to take the Pathagar OPDS Book Server and put it on a 4-watt SheevaPlug. The very cool and awesome duo of Mary Lou Jepsen and John Ryan helped us present the unit (it was a last minute thing...we weren't on the agenda). We even live-tested the unit. About 150 people hit the box and it held up. Load tests revealed that the plug server could serve 500 simultaneous users!
So, we had a self-contained book server, that could run off a solar panel, and arguably serve thousands of books in the middle of nowhere - a Wi-Fi bubble, that serves up books to all within its reach. Heck, we even have a virtual machine, complete with Pathagar on it!
Where do we get the books? The Internet Archive of course! With its 3 million plus books, its a vast ocean to fish from. The bigger challenge is the fishing part.
- How do you curate content for your little Wi-Fi bubble?
- And once you do so, how do you pull it all together?
Raj Kumar (@rajbot) at the Internet Archive had the answer. They have this script they have been working on, which pulls the books/media directly form the Archive. The script needs to be fed a bookmark file, that one may create after signing up at the Internet Archive. After a few conversations and a few trials, Raj has pointed me to the very cool fetch_IA_item script.
To get rolling:
- Sign up for an account on the Internet Archive.
- Log in
- Look for stuff on the Archive's pages, and when you find something interesting, bookmark it.
- Go to your "Patron Info" page, and grab the link for your bookmark file.
Go to https://github.com/rajbot/fetch_ia_item and get the script either as a zip file, or via git
- git clone git://github.com/rajbot/fetch_ia_item.git
- In the fetch_IA_item folder, edit the fetch_IA_item.py file and replace the sample user id with your own archive.org id (mine is sverma, as in the example).
Run it. You'll need python for this.
- python fetch_IA_item.py
- If you have books or other media in your bookmarks, the media and its metadata will start coming in. You can interrupt in the middle (CTRL-c) and pick up where you left off.
Works like a charm! Thank you Raj and the team at the Internet Archive. You guys rock!
Next, we need to work on getting the metadata into an appropriate json or csv format for Pathagar, but that's another project.