Capstone

Introduction

My capstone project is called Evaluating Emulation Workflows for Historical Software and Data. Dr. Eric Kaltman was my faculty mentor for this project.

Background

Due to the growing number of legacy files, programs, and other born-digital materials, there is a growing need to interpret this data for libraries and archives. Emulation is a recovery method that allows a modern computer to run historical software. It is commonly used in data archiving, but much of the scholarly work focuses on a higher-level view of emulation applied in the field (Paper about a high-level content profile of a 17 TB dataset from Carnegie Mellon). Working with emulated systems has many technical nuances that can even serve as a barrier to entry for technical users. We hope that this work shows commonalities in emulation workflows to help non-technical users, like archivists and librarians, better understand the granular steps in emulation.

Methods

This research started with instructing 3 undergraduate computer science research associates to record attempts to transfer and emulate software data from local materials acquired from the CSUCI Library and Carnegie Mellon’s Entertainment and Technology Center. Research associates analyzed the content, determined requirements to run the data during the study, and emulated 18 operating systems, 6 library disks, 6 projects, and cataloged 393 historical software instances. (example of emulating disk) They formatted their findings in an adapted diary study. The Research Associates used emulators including EaaSI, VirtualBox, SheepShaver, and Basilisk II, and recorded roughly 700 pages of process notes.

Next, we decided to analyze these notes using grounded theory principles. We loaded the notes into Atlas.ti, a standard qualitative data analysis tool. We developed 1447 codes with 3488 quotes and upwards of 41 memos. These codes were then grouped by similar concepts and position in the tech stack to extract the core components of an emulation workflow.

Emulation Stack

To successfully emulate historical software and data several layers of software need to be used so the data is run according to its historical context. At the base layer, a host operating system is needed. Depending on the type of emulator needed, Windows or Linux will need to be installed. An emulator program then reads the historical executable data associated with an older program and then imitates the program’s original operating system and computing environment. In our work, we made use of multiple local emulators and virtual machine managers. The SheepShaver and Basilisk II emulators were used for Mac OS operating systems, while the VirtualBox virtual machine hypervisor was used for Windows-based programs. We also made use of CI’s access to the Emulation as a Service Infrastructure (EaaSI) project operated by Yale University Library. The EaaSI project provides cloud-based emulations that are shared between libraries. Once an emulated system is running, the team needed to locate and install the relevant historical dependencies for each target project and data file.

This is an example of the tech stack we used during emulation work.

Note that historical data will likely need to be loaded into a guest operating system after being imaged on the host operating system through a program like WinCDEmu or a Bit Curator Linux VM. Sometimes data can be passed to the emulator software using some kind of bi-directional sharing. So the historical data is usually running as a .iso file.

Conclusions

We have a paper submitted for iPRES 2023. This is still a work in progress to finish the grounded theory work and then articulate the results from the processes identified. I can broadly draw some conclusions from the codes that were grouped. For example, I grouped the methods students used to share data between the host and guest systems. Just this simple topic had 14 different codes with 158 quotations associated with it. We have so many quotations because of the vastly different processes needed to transfer data on SheepShaver, Basilisk II, VirtualBox, and EaaSI. These categorizations can be further broken down when realizing that certain methods allow for bidirectional file sharing while other emulators only allow one-directional. The other code groups I derived focus on the location on the emulation stack and the processes being done. The next steps are to break down each part of the emulation stack and associate the processes with each technological layer.

Presentation Video

Blog Posts: