Overview of Distributed File Systems ------------------------------------ ============================================================================ We've talked about: * Decomposed operating systems * Client-server model * Remote procedure calls (and IPC in general) Now, let's talk about organizing an important operating system application, file service, into a distributed client-server model. Hardware trends: * Large, cheap disks (10Gbytes for $3k) * Networks of workstations/PCs * Cheap distributed processing power - A chicken in every pot, a workstation on every desk! * Server-class machines cost big bucks (marketing decision) ============================================================================= GOALS: * Transparency * Convenient naming scheme (ala uniprocessor file server) * Convenient sharing semantics * High performance (caching) * Ease of administration (or at least not administrative nightmares) * Fault tolerance (or at least not fault intolerance) * Reasonable cost ============================================================================= TRANSPARENCY: we'd like to make it easy to plug the distributed file system into what already works, and have it work too. Users should not be required to know how the distributed system is organized. The key issues boil down to *naming* and *location*. Other issues: fault tolerance, semantics, performance Common approaches: * Machine + path naming (e.g., /n/porter/...) * Mounting remote file systems into local hierarchy (e.g., /bin/...) - grafting remote mount points into local file hierarchy * Complete location independence (single namespace - eg, Mango Medley) - often done using dynamic mounting, hints+caching, or structured file ids (entire struct moves -- Andrew) Location transparency: name gives no hint as to location (e.g., /n/porter/u/retrac says its on porter, but not where porter is located on the network and the file is accessible from any location) Location independence: files can be moved between servers with no name change (/n/porter fails this test). * Greatly improves load balancing, administration, portability, and scalability (e.g., Andrew). Issue: Who expands names? Clients or servers? ============================================================================ SEMANTICS (Sharing): This problem will become more apparent when we start talking about distributed shared memory, but a key issue is what semantics we provide for shared files. The main issue boils down to how we do CACHING. Possible choices: * Unix semantics: - every read sees the effect of every write - writes to open files immediately reflected in all clients - can share location pointers (e.g., shell programs) Issues: * Expensive to implement! * Rarely required in full generality * Generally requires centralized, stateful algorithms * Can work around it (?) * Session semantics: - clients get a snapshot of file when they open it - local writes immediately visible - after close, other opens will see changes - no shared file pointers (e.g., shell script example) - breaks some programs, but common (e.g., Andrew) * Immutable files: - write-once, overwrite to modify - hard to use * Transaction oriented: - must be serializable - must use locks - must support rollback - pain in the ass to use * Real systems: - NFS (delayed write semantics) - Microsoft Lanman (oplocks) - Medley (oplocks plus DSM) ========================================================================= PERFORMANCE: Translates almost directly to efficient caching, avoiding disk accesses, and avoiding network IPC. As with any caching study, it is important to know how the data being cached is commonly used (Satya's study prior to Andrew/CODA): * Most files are small (under 10K) * Most data is stored in big files (the big ones are BIG) * Reading is much more common than writing * Sequential access dominates * Many temporary files * Files are only rarely actively shared * Distinct file classes, e.g., - binaries (rarely change, basically read-only -> replicate - compiler and other temporary files -> handle 100% locally - mailboxes and other private files -> basically never shared - regular files -> on demand replication might help + sharing Issues: * Whole file sharing versus blockwise sharing (Old Andrew versus Sprite LFS) * Where to cache? Clients (disk or memory) or server (memory)? Sprite folks make a strong argument for client side caching. File servers usually have tons of DRAM to do server caching. A combination seems in order: - client-side caching for unshared data, temp files - server-side caching for interprocess locality * If client, where? - library code (Boo! Little sharing or reuse) - client kernel (disk buffer cache) - separate client-side cache server (pin pages?) * If client, volatile or persistent? - most systems, only in volatile DRAM - in Medley, both volatile and persistent * Cache consistency model - disable caching * imposes serious load on server * hope sharing is RARE (LFS) - write through * lots of messages * need to revalidate at each subsequent open * does not help writes * does not help common case of temporary files - write back on close * session semantics * set timer, flush dirty blocks every N seconds * delay write back (for temporary files) (NFS+NT) * Who guarantees validity of cache contents (client or server)? - check on each open - callbacks (requires stateful server) - e.g., LanMan oplocks - DSM-like protocols * How do you deal with possible inconsistencies? - Can I use the cached data if remote server is down? ========================================================================== FAULT TOLERANCE: For most filesystems, this comes down to REPLICATION. Issues: * Consistency management: - voting algorithms (talk about reader/writer quorums and ghosts) - leader/cohort model - atomic commit protocols * Stateful (Andrew and LFS) versus stateless (NFS) servers STATELESS STATEFUL --------- -------- Better fault tolerance! Shorter request messages No OPEN/CLOSE calls Cache state information No server-side tables Prefetching Client crashes? No problem. File locking easier ============================================================================ SCALABILITY and PERFORMANCE design issues * Off load work from server(s) * Avoid centralized algorithms and bottlenecks * Consider using clusters * Replication (at least read-only!) * Multiple lightweight processes in server * Log-structured file systems ============================================================================ FUTURE TRENDS: * Portables (CODA) * Wide area file systems (e.g., WebNFS and CIFS) as replacement to HTTP for wide area sharing * Faster networks and processors, but "slower" disks and memory * Specialized servers: Object-oriented/database file servers