The basic goal of 18-847z's project component is for class members to explore, in a hands-on manner, some concrete question in the space of cloud and data-intensive computing and storage. Each project should explore issues, solve problems or exploit techniques from classroom discussions or papers.

You are encouraged to propose your own project idea, and we will provide a few project topic ideas (to help you brainstorm). It is more than fine for your project to serve some external purpose (e.g., contributing to your research agenda), but there must be a concrete project completed and reported on during the semester.

Logistically, you should work in groups of up to three, though you may do your own if you like. For experiments, use your research group's machines or an available cluster/cloud (e.g., Emulab, Open Cirrus, OpenCloud, maybe CMU's vCloud later in the semester).

Although you are all experienced with reading research papers, a good resource to look at is An Evaluation of the Ninth SOSP Submissions, or How (and How Not) to Write a Good Systems Paper (by R. Levin and D. Redell, in Operating Systems Review, vol. 17, no. 3, July 1983, pp. 35-40) -- it is available here.

Some example project ideas

  • port a data-intensive application to a cloud and evaluate the experience, including effort involved, performance consequences, etc.
  • a group of mostly-local researchers recently proposed an approach to elastically sizing the set of nodes contributing to the distributed file system serving data-intensive applications. It's described as being for power-scaling, but it also works for changing the set of nodes in a shared cloud, to allow other applications to use resources uninterrupted. Would be interesting to explore the write-offloading approaches and/or how to integrate it most cleanly into a popular free software system (e.g., Hadoop's HDFS or KFS).
  • incremental execution of MapReduce. If a dataset is a series of time specific logs/captures from some source, and a computation runs against all of the data of many time periods, how can MapReduce be modified/managed to reuse partial/intermediate results from prior runs with each new log added?
  • instrument a cloud used for multiple activities and collect/analyze data on the resources utilization/demands.
  • byte-addressible NVRAM, of some form (e.g., PCM or memristors), is coming. How could such "storage" be used in data-intensive computing?
  • Project Deadlines


    Due October 1 (PDF emailed to instructor).

    (no more than 3 pages; single spaced, one or two columns, 10 point font or larger)

    Describe the project idea/goal, how it relates to the course material, what work must be done (suggesting how it can be partitioned among you) and what resources you will need (including software and hardware systems you already have access to). Concentrate on convincing us that it will pertain to the course, that you will be able to complete it, and that we will be able to evaluate it. The third page should be dedicated to providing an outline of your intended final paper, identifying the specific experiments to be run and what questions they will answer.

    For examples, check out this old projects page from a different class.

    In-Class Presentation

    November 29-December 1.

    Everyone will describe their project and results to the rest of the class during the last week of classes. Exactly how much time per talk will be determined once we know how many projects there will be.

    Final Report

    Due December 3 (automatic extension to December 10).

    (up to 10 pages)

    This is the final project report, written in "computer systems paper" style. It should report goals, relationship to the course, implementation design, evaluation methodology, results and analysis, discussion of hypothesis outcome, most interesting future work, and a bibliography.