Archive of April 2005
Wow, blast from the past
Remember Troublefunk? (Awesome MP3-age courtesy of mastermix.org.)
15:41 | 0 CommentsPDF metadata geekery
Recently a friend called with an interesting question (interesting to me, at any rate). It seems that, as an academician, Peter has gathered a large and unwieldly personal collection of assorted academic papers in PDF format on his hard drive. So he’s wondering whether I might have any ideas as to a database-driven way to organize and navigate through this collection—ideally based on bibliographic attributes and other metadata such as keywords, cross-references, etc.
Well, it turns out this inquiry was almost spookily well-timed. Just a few days earlier, while doing some background research on vocabulary engineering, I had come across a rather interesting paper in Information Studies research, “Why can’t I manage academic papers like MP3s?” by James Howison & Abby Goodrum. This paper makes two key observations about different treatments of metadata:
Firstly, digital music metadata is standardized and moves with the content file, while academic metadata is not and does not. Secondly digital music metadata lookup services are collaborative and automate the movement from a digital file to the appropriate metadata, while academic metadata services do not.
There’s a pretty good point there, I thought. So, when Peter showed up with the same problem described in the paper, I began to think I might be on to something. I talked to a couple other researchers, and I found that this issue of burgeoning big blobs of PDF files is pretty common among them. Not an enormous market, perhaps, but worth looking at a solution for.
After a little more digging, I found out that Adobe publishes a standard called XMP for embedding metadata into PDFs (and certain other file formats) as an XML stream. I began to have visions of developing some sort of easy library manager, with the ability to edit XMP metadata, and organize the library according to this information—in much the way MP3 “jukeboxes” like iTunes do with ID3 tag info. Unfortunately, I soon found out I’m not the first to have ventured down this path, and it’s not nearly as easy as I had hoped, at least not without purchasing rather pricey licenses to commercial SDKs from Adobe.
The good news is that it turns out Michael McCracken has already come up with BibDesk, which is about as close to the thing I had in mind to develop as it’s possible to get right now. While it has to connect metadata to document files by linking rather than embedding, it is able to directly import BibTeX .bib files. This cuts down on most of the manual bibliographic data entry one might have to do, as many of your larger online citation databases have some BibTeX export utility. Even where these don’t, many of them can be automatically linked back to a personal library at Citeulike and you can export BibTeX from there. If I were a perfesser with a big mess o’ papers clutterin’ my hard drive, BibDesk is what I’d be using to keep track of them. I may just start using it myself, as I do tend to download and read a fair amount of research in Computer and Information Sciences, just for my own amusement.
Now as it turns out, Peter was actually looking for something that would not only organize his papers, but allow him to share them with colleagues. For this, we ended up installing Document Database on an extra box he had in his office. This looks like a pretty good tool for managing a shared document library in collaboration with a group of people. I would like to see it add built-in security enforcement (it currently relies on Basic HTTP Authentication) and more facilties for import and export of BibTeX files. Right now, and only appears able to export one citation at a time, for a limited number of fields, and does not import at all. So new entries have to be entered manually. Citeulike might be getting close to something like this, now that it has added file uploading, but at the moment there’s no sharing of uploaded files with other people. Since Richard is handling it this way owingto copyright concerns, I’m guessing this is unlikely to change, but you never know. He might be able to work something out for files that can be verified to have more relaxed restrictions.
Anyway, that’s about all the work I did on this subject. I just wanted to write it up here to get it all in one place, and share any results of this obsession that may be of help or interest to others. Although I no longer plan to code my own PDF taggerator, I remain interested in that XMP metadata spec. Since I have access to Adobe Acrobat at work, I’m at least going to make sure that any PDFs I author will have carefully written embedded metadata. The whole thing has also sparked a larger interest in RDF and the Semantic Web in general, about which I hope to write more at a later date—those thoughts need a little more time to incubate, I think.
09:36 | 0 CommentsGenius
Another Word for Nerd presents the best music video ever. I’ll let the link tell the story, but I will add my recommendation that you download the video, and watch it with the sound turned up. Funny, brilliant, and even kind of weirdly touching.
13:17 | 0 Comments