Using BagIt in 2018

One of my more popular posts to this blog has been my 2016 round-up of BagIt, the Library of Congress’ seminal file packaging specification/software library. My overall explanation for what BagIt is, why it’s so important, the still-scattered state of documentation, and the need for a roundup of implementations for practical use all still stand… but I’ve realized lately that this post/topic could use a revisit, for a couple of reasons:

1) A year on, I’ve done a lot more interaction on GitHub and with open source software, and I regret my general tone when discussing the need for better BagIt documentation. One of the beautiful things about open source projects (and BagIt particularly, since the LoC hosts all the code for the BagIt libraries and several of its implementations on GitHub, which is *made* for collaboration) is the opportunity for direct, constructive feedback. I should have raised my problems with unclear documentation as an issue on GitHub (looky here, just as I did while preparing this post), or at least posed my confusion as a question/concern to be improved, rather than as a complaint “behind the backs” of the developers! Etiquette is important, and I will do better at remembering that digital preservation is not an unfeeling collection of tools and tech – there are people behind every line of code and every social media post (OK besides the twitter bots but you know what I mean).

2) Software changes! It updates! That’s the whole point! And instructions that worked even a year or two ago may no longer work in the most contemporary environments. To that end, there have been some changes in macOS systems in particular that make me want to create new installation instructions (particularly for bagit-python) to help people avoid headaches.

So check out my previous post for why BagIt’s great – and then look below for a new roundup of how and why to use its various interfaces and implementations in 2018! (yeah I know it’s still 2017, but as much as digital preservation is about constant updating I’d like to future-proof this thing by *at least* two weeks, ya know?)

1. Bagger (GUI)

It’s Bagger! Still a nice intuitive GUI interface with big honking buttons for the basic tasks of bagging (creation from multiple files or bagging a directory in place, adding metadata, verification/validation). Still probably the best/most intuitive implementation for novice users. And the LoC GitHub repo for Bagger now has specific first-time installation/run instructions for both Windows and Mac. Beautiful!

2. bagit-java (library + CLI)

The LoC’s bagit-java 5.x library can be incorporated into any scripts or applications written in Java (such as the two GUI implementations elsewhere on this list). It can not, however, be interfaced with as a stand-alone command line utility. For that, you can still install and use bagit-java version 4.x, even though that version has obviously been surpassed and is not being actively developed. For installing and using bagit-java via CLI, you can use Homebrew (note that you will need Java installed as well):

$ brew install bagit

which installs bagit-java v4.12.3. Documentation for using bagit-java can be found in the utility’s help page, invoked with:

$ bagit –help

Just a quick note: the –help page incorrectly refers to the command to invoke bagit-java. The help page usage example says to use “$ bag <operation> [operation arguments]”, but the correct syntax is in fact “$ bagit <operation> [operation arguments]” ! (per my question on GitHub about this, apparently this problem is hard-coded and would require recompiling the Java source rather than just tweaking a doc, so since bagit-java CLI isn’t actively maintained, no fix is forthcoming)

Screen Shot 2017-12-15 at 11.54.15 AM.png

3. bagit-python (library + CLI)

So, this section is really less an update on bagit-python and more an update on python itself. Bagit-python can still be used either as a library to integrate into scripts and applications written in python, or as stand-alone command line utility. Your preference for using bagit-java or bagit-python in the CLI could be decided by looking at both utilities’ help pages – but I would actually reverse my previous recommendation and go back to generally recommending/using bagit-java, as the bagit-python CLI appears pretty exclusively aimed at creating bags-in-place and has fewer commands for verification, splitting, flexible bag creation and updating, etc. In either case, if you are interested in using/installing bagit-python, changes in recent macOS versions have meant that my previous instructions created more headaches than intended.

(Thanks to the brave MIAP students in Video Preservation who discovered and tried to deal with these inconsistencies!!)

So, for explanation: starting with OSX 10.11 (El Capitan), Apple introduced a feature called System Integrity Protection, nominally to keep unverified or malevolent applications downloaded from the internet from messing with critical OS-installed system software. What this means is, without futzing around a lot with permissions (which is not a great idea for a novice user), using a package manager like Homebrew winds up with some software in the OS-controlled “/usr” directory and its subfolders, and some software in the user- or package-manager-controlled “/usr/local” directory and its subfolders.

My previous instructions, which directed people to mix the default macOS-installed version of Python with the user-installed versions of pip (python’s package manager) and bagit-python, generated a whole bunch of permissions issues.

The solution? Stay away from the macOS python altogether and install all components with a package manager to keep the installation contained within “/usr/local”.

So, assuming you have Homebrew installed:

  1. $ brew install python

    This will install Homebrew’s Python 2.x package (currently 2.7.14), which includes the pip package manager by default (macOS’ Python package does *not* include pip by default). Note however! Since your Mac already came with a python installation (at /usr/bin/python), Homebrew renames its versions “python2” and “pip2” to avoid confusion/overwriting. (so its commands/binaries live at /usr/local/bin/python2 and /pip2)

  2. $ pip2 install bagit

    THAT’S IT. “sudo” shouldn’t be necessary. You can invoke bagit-python commands with

$ bagit.py [path/to/directory]

just as before. Check out the help page with the “–help” flag for more info.

(If you already tried to install bagit-python with the previous instructions, you will likely need to do some cleanup in the /usr folder to clear everything out and stop throwing errors. If you need help or advice doing this, feel free to get in touch!!)

4. BagIt for Ruby (library + CLI)

The “bagit-ruby” implementation has been expanded and documented since last year! If you are interested in including a BagIt module in a Ruby application/script, or using this version via the command line, you’ll first need to install Ruby:

$ brew install ruby

Which will include Ruby’s built-in package manager, gem.

$ gem install bagit validatable

Note that you can’t install this Ruby package and the Homebrew package of bagit-java at the same time, as you’ll get a collision with them named the same thing in /usr/local/bin. Once downloaded/installed with gem, the BagIt for Ruby CLI is documented at:

$ bagit –help

….but it’s real basic, even compared to the CLI for bagit-python. This particular implementation is probably most ideal for its library and incorporation in Ruby scripts/apps, not necessarily for direct command line interfacing.

5. Exactly (GUI)

Not much more to say about AVPreserve’s packaging/transfer application since last year – but the combined ability to not just bag, but deliver or receive directories over standard network protocols still make it a great option for those on Mac or Windows and in need of a simple workflow that combines two major ingest steps (bagging and delivery) into one quick and easy tool.

6. bagger-js (experimental library + web app)

Likewise, the LoC’s BaggerJS library/app could serve as both bagging and delivery system, via a web browser interface instead of a stand-alone, downloaded app. It’s basically “bagit-javascript” – that is, the BagIt library written in JavaScript (which is a web programming language entirely separate from Java). I assume it’s referred to as “bagger-js” because in the LoC’s naming system, “bagger” implies a GUI, whereas “bagit” is just the underlying library or CLI.

Bagger-js is still referred to in the LoC GitHub repo as “experimental”, so the library and accompanying demo web interface (which can bag a local directory and send it to a remote server compliant with Amazon’s s3 protocol) are not production-ready like Bagger or the other BagIt libraries/interfaces. But, again, all the work that they’ve done so far is right there and available to adapt/incorporate into your own JavaScript/web app projects!

7. other apps

Of course, there are likely a number of applications or other pieces of software that incorporate BagIt as one piece or microservice of a larger workflow/system. Archivematica’s a major one that I’m aware of. Maybe you have another! Feel free to let me know what I’ve missed.

 

Advertisements

Classroom Access to Interactive DVDs

Normally my focus as MIAP Technician is on classroom support for courses in the MIAP  M.A. curriculum – but, as a staff member of the wider NYU Cinema Studies department, there are occasionally cases where I can assist non-MIAP Cinema Studies courses with a need for archival or legacy equipment.

That was the case recently with a Fall 2017 course called “Interactive Cinema & New Media”, which challenged the skills I learned in MIAP regarding disk imaging, emulation, and legacy computing, and provides, I think, an interesting case study regarding ongoing access to multimedia software-based works from the ’90s and early 2000s.

In this project I worked closely with Marina Hassapopoulou, the Visiting Assistant Professor teaching the course; Ina Cajulis, recently hired as the department’s Special Events/Study Center Coordinator (also a Cinema Studies M.A. graduate who took several MIAP classes, including Handling Complex Media, the course most focused on interactive moving image works); and Cathy Holter, Cinema Studies Technical Coordinator.

Last fall when Marina was teaching “Interactive Cinema”, I worked briefly with her request to give students access to a multimedia work by Toni Dove, called “Sally or the Bubble Burst”. “Sally” is an interactive DVD-ROM in which users can navigate various menus, watch videos, and interact (sometimes via the keyboard, sometimes using audio input and speech recognition software) with a number of characters, primarily Sally Rand, a burlesque dancer from the mid-20th century. Because it was created/released in 2003, “Sally” has some unique technical requirements: namely, a PowerPC Mac running either OS 9.1-9.2 or OSX 10.2-10.6. At the time, we had to move quickly to make the DVD available for the class – after testing the disc on a couple of legacy OSX laptops from the Old Media Lab, we decided to temporarily keep an old PowerPC iBook running OSX 10.5 in the department’s Study Center lounge, where students from the “Interactive Cinema” course could book time to view “Sally”. This overall worked fine, although there was some amount of lag (some futzy and not-great sounds coming from the laptop’s internal disc drive made me prefer to run the disc off of a USB external drive – better for the disc’s physical safety, worse for its data rate), and the disc’s speech recognition components were not responsive, likely an issue with the laptop’s sound card.

Fast-forward to August 2017. Submitting her screening list for the semester, Marina let us know that not only would she be needing students to have access to “Sally or the Bubble Burst” again, but she was also expanding the course syllabus to include a number of similar interactive software-based works (by which, I’ll define, I mean CD- or DVD-ROMs with moving image material that require specific computer hardware or software components; not just an interactive DVD that will still play back in any common DVD player, which Marina also includes in her course but provide much less of a technical challenge). With more time to plan, I was interested in both more extended testing, to make sure “Sally” and all these works ran more as intended; and to have a discussion with Marina, Cathy, and Ina so we could strategize longer-term plans for access to these works. Quite simply, we are lucky that the department has (largely, I think, thanks to the presence of the MIAP program) over the years maintained a varied collection of legacy computers that could now run/test these works – we may not continue to be so lucky as the years wear on.

The alternative is pretty straightforward: migrate the content on these DVD-ROMs to file-based disk images, and run them through emulators or virtual machines on contemporary computer hardware rather than worn-down, glitchy, eventually-going-to-break legacy machines. But the questions with these kinds of access projects are always, A) has the content really been properly migrated/recreated, and B) does the experience of using the work on contemporary hardware acceptably recreate the experience of the work on its originally-intended hardware. The latter in particular was a question I could not answer on my own – without having seen, interacted with or studied these works in any detail, I did not consider myself in a position to judge whether emulated versions of these works were running as intended, in a manner acceptable for intense, classroom study. Marina and Ina, as scholars of interactive cinema and digital humanities, were in a better position to make an informed decision.

So, my initial goals were:

  • prepare a demo of emulated/virtualized works
  • match each interactive DVD with a legacy computer on which it ran best, for comparison’s sake, or, failing the emulation route, providing access to students

I set aside “Sally or the Bubble Burst”, as its processor/OS requirements put it squarely in the awkward PowerPC + Mac OSX zone that has proven difficult for emulation software and plagued my nightmares in the past. That left three discs to work with, listed here along with the technical requirements outlined in their documentation:

I wasn’t looking to perform bit-for-bit preservation/migration with this project. We still have the discs and their long-term shelf life will be a concern for another day – today I wanted acceptable emulation of the media contained on them. So by Occam’s Razor, I considered Mac’s Disk Utility app to be the quickest and best solution in this case to make disk images for demo and testing.

After selecting a disc in Disk Utility’s side menu, I browsed to Disk Utility’s File menu, selected “New Image” and then “Image from [name_of_disc]”.

Screen Shot 2017-09-29 at 1.43.11 PM.png

I selected the “CD/DVD master” option with no encryption, which, after a few minutes, creates a .cdr file. This was repeated three times, once for each disc.

Screen Shot 2017-09-29 at 1.44.01 PM.png

With a .cdr disk image ready for each work, now it was time to set up an emulated legacy OS environment to test them in. I decided to start with Mac OS 9 – an environment I was already familiar with and which matched at least the OS requirements of all three works.

For emulating Mac OS 8.0 through 9.0.4, I’ve had a lot of success with a program called SheepShaver. Going through all the steps to set up SheepShaver is its own walk-through – so I’m not even going to attempt to recreate it here, and instead just direct you to the thorough guide on the Emaculation forums, which is what I use anyway. (the only question generally is, where to get installation discs or disk images for legacy operating systems – we have a number still floating around the department, but I also have WinWorld bookmarked for all my abandoned software needs).

Once I got a working Mac OS 9 computer running in SheepShaver, I could go into SheepShaver’s preferences and mount the disk images I made earlier of “Immemory”, “Bleeding Through” and “Artintact” as Volumes, so that on rebooting SheepShaver, these discs will now appear on the emulated desktop, just as if we had inserted the original physical discs into an OS 9 desktop.

Screen Shot 2017-09-29 at 12.52.23 PM.png

First off I tried “Immemory”, the oldest work which also only required QuickTime v. 4.0 – which is the default version that comes packaged with Mac OS 9. I couldn’t be sure it was running exactly as planned, but the sound and moving images on the menu played smoothly, and I could navigate through the program with ease (well, relative ease – spotting the location of your cursor is often difficult in “Immemory”, but from reading through the instructions that seemed likely to be part of the point).

Screen Shot 2017-09-29 at 12.43.54 PM.png

Screen Shot 2017-09-29 at 12.44.29 PM.png

The next challenge was that “Bleeding Through” and “Artintact” required higher versions of QuickTime than 4.0. How do you update an obsolete piece of software in a virtual machine? First, scour the googles and the duckduckgos some more until you find another site offering abandonware (WinWorld, unfortunately, only offered up the QuickTime 4.0 installer). Yes, you need to be careful about this – plenty of trolls and far more malevolent actors are out there offering “useful” downloads that turn out to be malware. Generally I’m going to be a little more trusting of a site offering me QuickTime 5.0 than one offering QuickTime X – ancient software that only runs on obsolete or emulated equipment isn’t exactly a very tempting lure, if you’re out phishing. But, still something to watch out for. Intriguingly, I found a site called OldApps.com, similar to WinWorld, in that it has a stable, robust interface, a very active community board, and at least offers checksum information for (semi-)secure downloads. Lo and behold, a (I’m pretty sure) safe QuickTime 6.0.4 installer!

Screen Shot 2017-09-29 at 12.59.34 PM.png

With that program downloaded, now I had to get that into the virtual Mac OS 9 environment. Luckily, SheepShaver offers up some simple instructions for creating a “Shared” folder to shuttle files back and forth between your emulated desktop and your real one.

Screen Shot 2017-09-29 at 12.56.19 PM

Screen Shot 2017-09-29 at 12.57.52 PM.png

With the QuickTime 6 installer moved into my virtual environment, I could run it and ta-da: now the SheepShaver VM has QuickTime 6 in Mac OS 9!

Screen Shot 2017-09-29 at 12.53.51 PM.png

This is the point where I admit – everything had gone so swimmingly that I got a bit cocky. With the tech requirements fulfilled and the OS 9 environment set up, I went into the demo session with Marina, Ina and Cathy without having fully tested all three discs myself beforehand on the hardware it was going to run. And the results were…unideal. The color scheme on the menu for “The Complete Artintact”, supposed to be rendered in bright primary colors, was clearly off:

Screen Shot 2017-09-29 at 12.46.29 PM.png

Audio on “Bleeding Ground” played correctly, but there was no video, and the resolution on the menus was all off and difficult to control:

Screen Shot 2017-09-29 at 12.48.58 PM.png

Screen Shot 2017-09-29 at 12.49.27 PM.png

And even “Immemory”, which had run so smoothly at the start, now had clear interruptions in the audio, broken videos, and transitions between slides/pages were clunky and stuttered.

Though Marina came away impressed with the virtual OS 9 environment and the general idea of using emulators rather than the original media to provide access, the specific results were clearly not acceptable for scrutinous class use. Running some more tests and troubleshooting, I came to two conclusions: first, the iMac we were trying to install SheepShaver on in the Study Center was several years old, and probably not funneling enough processing power to the emulated computer to run everything smoothly. But also, I suspect that the OS 9 virtual machine was missing some system components or plugins for the later works (“The Complete Artintact” and “Bleeding Through”), and that the competing requirements (different versions of QuickTime in particular) was causing confusion when crammed together in one virtual environment – in other words, using QuickTime 6 was actually *too advanced* to run “Immemory”, designed for QuickTime 4.

So, solutions:

  • keep “Immemory” isolated in its own SheepShaver/OS 9 virtual machine with QuickTime 4
  • test “The Complete Artintact” and “Bleeding Through” in a virtual Windows machine, for comparison against different default OS components
  • install everything on a brand-new, more souped-up iMac

Success! Kept alone in its own virtual Mac OS 9 machine with QuickTime 4, “Immemory” went back to running smoothly. Using a different piece of emulation/virtualization software called VirtualBox (maintained by Oracle, and designed primarily to run Windows and Linux VMs), and going back to WinWorld and OldApps for legacy installers, I created a Windows 2000 virtual machine running QuickTime 6 for Windows for “The Complete Artintact” and “Bleeding Through” (settings in screenshot):

Screen Shot 2017-09-29 at 12.16.08 PM.png

Installed on new, powerful hardware (2016 iMac running macOS 10.12) that could correctly/quickly funnel plenty of CPU power and RAM to virtual machines, the works now looked “right” to me, and a second demo with Marina and Ina confirmed:

Screen Shot 2017-09-29 at 12.37.41 PM.pngScreen Shot 2017-09-29 at 12.39.34 PM.png

Screen Shot 2017-09-29 at 12.37.57 PM.png

Screen Shot 2017-09-29 at 12.38.35 PM

(The one last hitch: in “The Complete Artintact”, which is really an anthology collection of a number of interactive software works, some of the pieces had glitchy audio. Luckily, this was solved using VirtualBox’s sound settings, switching to a different virtualized audio controller, from “ICH AC97” to “Soundblaster 16”):

Screen Shot 2017-09-29 at 12.19.32 PM.png

There were a few more setup steps to make accessing the works easier to the students: creating desktop shortcuts for the virtual machines on the iMac desktop AND the disk images inside the virtual machines (so that students could just click through straight to the work, rather than navigating file systems on older, unfamiliar operating systems); adding an extra virtual optical drive to the Windows 2000 VM so that the VM could be booted up with both “Artintact” and “Bleeding Through” loaded at the same time; and creating a set of instructions and tips for the students to follow regarding navigating these emulators and legacy operating systems (for troubleshooting purposes).

Screen Shot 2017-09-29 at 12.36.35 PM.png

That left legacy testing for backup, as well as the question of “Sally and the Bubble Burst”. At this point time was running short, and emulating “Sally” seemed likely to be a more difficult and prolonged process. Luckily, we had an iMac running Mac OSX 10.6 (Snow Leopard), which includes Mac’s Rosetta software installed for running PowerPC applications (like “Sally”) on Intel machines. A disk image of Toni Dove’s work runs smoothly on that machine, including speech recognition input via the iMac’s built-in mic.

I did also run “Immemory”, “Bleeding Through” and “The Complete Artintact” on a Apple G4 desktop running OS 9 and QuickTime 6 – for whatever reason, running this combination of discs and software on the original hardware, as opposed to in the SheepShaver VM, did work acceptably. Though at this point, we accepted the emulation solution for class access, if at any point anything goes wrong, we can move that G4 from the Old Media Lab to the Study Center and run all the discs (or disk images) on that legacy machine, rather than the squiffy laptop solution that we used for “Sally or the Bubble Burst” a year ago.

So, there is the saga of “Interactive Cinema”. Aside from all this is concern that the disk images I made for this process don’t really constitute bit-for-bit preservation, and though Marina thought they were all running as intended, these are incredibly broad works and exploring and testing every detail manually was basically impossible. Ultimately, we may want to create forensic disk images off of the CD- and DVD-ROMs to ensure that we’re really capturing all the data and can ensure access to them in the future. But for now….it’s time for me to take a break!