You must form a group of three CS6963 students to collaborate on the project.
You'll turn in your code and a short write-up
describing the design and implementation of your project, and make
a short in-class presentation about your work. We will post your write-up
and code on the web site after the end of the semester, unless you
explicitly talk to us about why you want to keep yours confidential.
Your project should be something interesting and challenging that's
closely related to CS6963 core topics, such as fault tolerance. Below
you'll find some half-baked ideas that we think could turn into
interesting projects, but we haven't given them too much thought.
Deliverables
There are four concrete steps to the final project, as follows:
Form a group and decide on the project you would like to work on.
Feel free to use Canvas
to find group members and discuss ideas.
Course staff will be happy to discuss project ideas via e-mail or in
person.
Flesh out the exact problem you
will be addressing and how you will go about solving it.
By the proposal deadline, you must
submit a proposal (less than a page) describing: your group members
list, the problem you want to address, how you plan to address it,
and what are you proposing to specifically design and implement.
Submit your proposal to both the TA and the instructor via email.
We'll tell you whether we approve, or not, and give you feedback.
The projects can take almost any form. Here are some high-level templates;
below are more specific ideas:
Use your research area: several students work on labs working on
distributed systems projects or project adjacent to distributed systems. If
at all possible, leverage that to try to find a new question related to the
work you already do. Specifying distributed systems with domain-specific
languages, modeling them, visualizing them are all related to Ganesh's DS2
project. Tackling a small concrete question or producing a related demo is
perfect.
Extend/improve/measure existing systems: Runway would be a great project to
contribute to. The core infrastructre could be improved, but even just
providing additional models would be great.
Extend the labs: implement Lab 3b, Lab 4, and Lab 5 (true persistence) and
run your Raft KVS on a real network on Emulab. Find one unique question or
enhancement to assess. Profile the performance and/or find pathologies
(starvation due to leader election, asymmetric geo-graphic placement with
unfortunate leader placement, ePaxos-like enhancements, assessing
costs/tradeoffs of many/few Raft groups, etc).
A literature review: find and review 3 to 5 papers related to a specific
topic/theme (MR/Spark, replication, load balancing, consistency, distributed
transactions, etc, etc) from the most recent top conferences (SOSP, NSDI,
VLDB, SIGMOD, '15, '16). Such a review should include comparisons of common
approaches/themes or infer a trajectory for that area of research. (e.g. Read
RAMCloud SOSP'15, FaRM NSDI'14 and SOSP'15, End of Slow Networks VLDB'16;
closely compare the performance, data model, fault-tolerance, cost tradeoffs
of the different transactions approaches).
Execute your project: design and build something neat!
Write a document describing the design and implementation of your project,
and turn it in along with your project's code by the final deadline. The
document should be about 3 pages of text that helps us understand what
problem you solved, and what your code does. The code and writeups will
be posted online after the end of the semester.
Prepare a short in-class presentation about the work that you have
done for your final project. We will provide a projector that you can use to
demonstrate your project. Depending on the number of project groups,
we may have to limit the total number of presentations, so some groups
might not end up presenting.
Half-baked project ideas
Here's a list of ideas to get you started thinking -- but you should
feel free to propose your own ideas.
Instrument your Raft implementation and visualize it with ShiViz.
Model Two-phase commit in Runway. See
if it can be used to find/debug blocking under certain node failure patterns.
Design a strategy for scaling up a memcached cluster (that uses consistent
hashing, for example). Measure the impact on cache hit rates and performance when
the configuration is changed.
Simulate a protocol similar to Lab 4 (a partitioned Raft-based KVS) and
compare its tail latency to a Dynamo (with (N=3, R=2, W=2)) based approach.
Port a simple web application (but more interesting than shopping cart)
from a conventional database to only using
CRDTs
and try running it when two sites span a wide geographic area.
Understand the memory fragmentation issues of modern DSMs and design a
solution.
Port a service to a Unikernel; compare the request latency distribution to
running on Linux and characterize the differences you see (see Leverich, et al).
Look at the dispatch overhead of a modern request-response based service
and design some form of lightweight event dispatch to reduce overheads.
Specify a simple system in TLA or Coq. State key invariants and prove
correctness.
Develop a system that transmits responses directly from a data structure
without synchronizing with writers, but uses client-size logic to patch up
inconsistencies.
Simulate a transaction protocol from class, like Thor, under more modern
network assumptions and suggest improvements.
Build a distributed, decentralized, fault-tolerant reddit.
Make the state synchronization protocol (DDP) in Meteor more efficient (e.g., send
fewer bytes between server and client) and more fault-tolerant (e.g.,
a client should be able to tolerate server failures, as long as enough
servers remain live).
Build a fault-tolerant file service; on the client side, you could
use FUSE to run your own client code, or you could have clients talk
NFS to your server, as in Harp.
Build a better fault-tolerant peer-to-peer tracker for BitTorrent.
Build a system for making Node.js applications fault-tolerant,
perhaps using some form of replicated execution.
Add cross-shard atomic transactions to Lab 4, using two-phase commit
and/or snapshots.
Build a system with asynchronous replication (like Dynamo or
Ficus or Bayou). Perhaps add stronger consistency (as in COPS
or Walter or Lynx).
Build a
distributed shared memory (DSM) system, so that you can run
multi-threaded shared memory parallel programs on a cluster of
machines, using paging to give the appearance of real shared memory.
When a thread tries to access a page that's on another machine, the
page fault will give the DSM system a chance to fetch the page over
the network from whatever machine currently stores.
Build a distributed RAID in the style of FAB.
Maybe you can get standard operating systems
to talk to you network virtual disk using iSCSI or
Linux's NBD (network block device).
Build a coherent caching system for use by web sites (a bit
like memcached), perhaps along the lines of
TxCache.
Build a distributed cooperative web cache, perhaps along
the lines of
Firecoral or
Maygh.