coda server internals n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Coda Server Internals PowerPoint Presentation
Download Presentation
Coda Server Internals

Loading in 2 Seconds...

play fullscreen
1 / 33

Coda Server Internals - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Coda Server Internals. Peter J Braam. Contents. Data structure overview Volumes Vnodes Inodes. Data Structure Overview. Object. Purpose. Resides where. Inodes. File Contents. /vicep* partitions. Volumes Vnodes Directory cnts ACL Reslogs. Meta Data & Dir contents. RVM.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Coda Server Internals' - maia


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
contents
Contents
  • Data structure overview
  • Volumes
  • Vnodes
  • Inodes
data structure overview
Data Structure Overview

Object

Purpose

Resides where

Inodes

File Contents

/vicep* partitions

Volumes

Vnodes

Directory cnts

ACL

Reslogs

Meta Data &

Dir contents

RVM

Volume location

VLDB, VRDB: RW db files

Volinfo records

VSGDB, .pdb, .tk files:

dynamic RO db files

VSGDB

Pdb records

Tokens

Security

Servers/SCM

Partitions

Startup flags

Skipvolumes

LOG & DATA & DB Locators

Configuration Data

Static data

rvm layout coda globals h
RVM layout (coda_globals.h)
  • Already_initialized (int)
  • struct VolHead[MAXVOLS]
  • struct VnodeDiskObject *SmallVnodeFreeLists[SM_FREESIZE]
  • short SmallVnodeIndex
  • …. Same for large …
  • MaxVolId (unsigned long)
  • Remainder is dynamically allocated
volume zoo volume h camprivate h
Volume zoo (volume.h, camprivate.h)
  • RVM: structures
    • VolumeData
    • VolHead
    • VolumeHeader
    • VolumeDiskData
  • VM: structures
    • Volume
    • VolumeInfo ……..
a volume in rvm

VolHead

VolumeHeader

VolumeHeader

VolumeData

stamp

id

parentid

type

*volumeDiskData

*smallVnodeLists

nsmallVnodes

nsmallLists

-- same for big --

A volume in RVM

contains

pointer to

rvm malloced data

volumediskdata rvm
VolumeDiskData (rvm)
  • Lots of stuff:
    • Identity & location: partition, name,
    • runtime info: use, inService, blessed, salvaged
    • Vnode related: next uniquefier
    • Versionvector
    • Resolution flags, pointer to recov_vol_log
    • Quota
    • Resource usage: filecount, diskused etc
volumes in vm
Volumes in VM
  • struct Volumes sit in VolHash with copies of RVM data structures
  • Salvage before “attaching” to VolHash
  • Model of operation (FS):
    • GetVolume copy out from RVM
    • Do your mods in VM
    • PutVolume does RVM transaction
  • Model of operation (Volutil):
    • operate on RVM
volumes in venus rpc s
Volumes in Venus RPC’s
  • One RPC: GetVolInfo
    • used for mount point traversal
  • Only relates to
    • volume location database
    • volume replication database
    • VSGDB
  • Could sit in separate Volume Location Server
vnodes cvnode h
Vnodes (cvnode.h)
  • Small & large: large for directories
    • difference is ACL at back of large vnodes
  • Inode field:
    • small vnodes: points to diskfile inode number
    • large vnodes: is RVM address of dir inode
  • Contain important small structure: vv_t
  • Pointers to reslog entries
  • VM: cvnode’s with hash table, freelists etc
vnodes in rvm
Vnodes in RVM
  • RVM: VnodeDiskinfo (rvm_malloced)
  • vnodes sit on rec_smolists
    • each link points to a DiskVnode
    • lists link vnodes with identical vnodenumbers but different uniquefiers
    • new vnodes grabbed from FreeLists (index.cc, recov{a,b,c}.cc)
    • volumes have arrays of rec_smolists which grow when they are full
vnodes in action
Vnodes in action
  • Model:
    • GetFSObj calls GetVnode
    • work is done
    • PutFS Objects calls
      • rvm_begin_transaction
      • ReplaceVnode - copies data from VM to RVM
      • rvm_end_transaction
  • Getting a vnode takes 3 pointer derefs, possibly 3 page faults vs. 1 for local file systems.
  • Is this necessary? Probably not. Cure it: yes!
directories rvm
Directories (rvm)
  • DirInode
    • page table and “copy on write” refcount
  • DirPages 2048 bytes each
    • build up the directory
    • divided into 64 32byte blobs
    • Hash table for fast name lookups
    • Blob Freelist
    • Array of free blobs per page
directories
Directories
  • More than one vnode can point to directory (copy on write)
  • VM: hash table of DirHandles
    • point to VM contiguous copy of dir
    • point to DirInode
    • have a lock etc
  • Model: as for volumes & vnodes
  • Critique: too baroque
files
Files
  • Vnode references file by InodeNumber
  • Files are copy on write
  • There are “FileInodes” like dir inodes, but they are held in external DB or in inode itself
  • Server always reads/writes whole files (could be exploited)
volinit and salvage
Volinit and salvage
  • Set up volume hash table, serverlist, DiskPartitionList
  • Cycle through partitions, check each for
    • list of inodes
    • every inode has a vnode
    • every vnode has a directory name
    • every directory name has a vnode
  • Put volume in a VM hash table
server connection info
Server connection info
  • Array of HostEntry (a “venus”)
    • Contains a linked list of connections
    • Contains a callback connection id
  • Connection setup
    • first binding creates a host & callback conn
    • new binding creates a new connection and verifies callback
    • in RPC2_NewBinding & ViceNewConnectFS
callbacks
Callbacks
  • Hashtable of FileEntries:
    • each contains Fid
    • number of users
    • linked list of callbacks
  • Callbacks: point to HostEntry
  • Ops:
    • RPC: BreakCallBack
    • Local: placing, delete, deleteVenus
callbacks1
Callbacks
  • Connection is non-authenticated. Should be fixed. Session key for CB connection should not expire.
  • Side effect of callback connection is used for BackFetch bulk transfer of files during reintegration.
rpc processing
RPC processing
  • Venus RPC’s:
    • srvproc.cc - standard file ops
    • srvproc2.cc - standard volume ops
    • codaproc.cc - repair stuff
    • codaproc2.cc - reintegration stuff
  • Volutil RPC’s:
    • vol-your-rpc.cc (in coda-src/volutil)
  • Resolution: below
rpc processing1
RPC processing
  • RPC structure:
    • ValidateParms: validate, hand off COP2, cid
    • GetObject: vm copy, lock objects
    • CheckSemantics:
      • Concurrency, Integrity, Permissions
    • Perform operations:
      • BulkTransfer, UpdateObjects, OutParms
    • PutObject: rvm transactions, inode deletions
vlists
vlists
  • GetFSObjects: instantiate a vlist
    • RPC needs list of objects copied from RVM
    • Modification status is held there (did CopyOnWrite kick in etc)
  • PutObjects
    • rvm_begin_transaction
    • walk through the list, copy, rvm_set_range, unlock
    • rvm_end_transaction
cop2 handling
COP2 handling
  • In COP2 Venus give final VV to server
  • are sent out by Venus (with some delay) often piggybacked in bulk
  • server knows about pending COP2 entries in hash table (coppend.cc)
  • Manager thread CopPendingManager
    • Runs every minute.
    • Removes entries more than 900 secs old
cop2 to rvm
Cop2 to RVM
  • Data can be
    • PiggyBacked on another rpc
    • sent in ViceCop2 rpc.
  • Both cases call InternalCop2 (srvproc.cc)
  • InternalCop2 (codaproc.cc)
    • notifies the manager to dequeue
    • gets the FS objects listed for the COP2
    • installs final VV’s into RVM (rvm transaction!)
cop2 problems
COP2 Problems
  • Easy cause of conflicts in replicated volumes when clients access objects in rapid succession. (Can be fixed easily during the writeback caching operation)
  • Not optimized for singly replicated volume.
resolution
Resolution
  • Initiated by client with RPC to coordinator
    • ViceResolve (codaproc.cc)
  • coordinator
    • sets up connections in VSG (unauthenticated)
    • LockAndFetch (res/reslock, resutil):
      • lock volumes,
      • collect “closure”
resolution special cases
Resolution - special cases
  • RegResDirRequired (rvmres/rvmrescoord.cc)
  • check for
    • unresolved ancestors
    • already inconsistent
    • runts (missing objects)
    • weak equality (identical storeid)
recovdirresolve
RecovDirResolve
  • Phase II: (rvmres/{rescoord,subphase?}.cc)
    • coordinator request logs from other servers
    • subordinates lock affected dirs,marshall logs
    • coordinator merges logs
  • Phase III:
    • ship merged log to subordinates
    • perform operations on VM copies
    • Return results to coordinator
resolution1
Resolution
  • Phase IV: (is old Phase 3 …)
    • collect results, compute new VV’s ship to subordinates
    • commit results
comments on resolution
Comments on resolution
  • Old versions of resolution:
    • OldDirResolve: resolve only runts and weak
    • DirResolve: resolve only in VM
    • Remove these
  • resolve directory has nothing to do with resolution: should be called librepair. Srv uses merely one function in it - repair uses the rest
volume log
Volume Log
  • During FS operations, log entries are created for use during resolution
  • Different format per operation (rvmres/recov_vollog.cc)
  • Added to the vlist by SpoolVMLogRecord
  • Put in RVM at commit time
repair
Repair
  • Venus makes ViceRepair RPC.
    • File and symlink repair: BulkTransfer the object
    • Directory repair, BulkTransfer the repair file and replay operations
    • Venus follows this with a COP2 multi rpc
    • For directory repair Venus invokes asynchronous resolve
future
Future
  • Good:
    • Design is simple and efficient
    • There is little C++: should eliminate
    • easy to multi-thread
  • Bad:
    • Scalability ~8GB in practice, ~40GB in theory
    • Data handling is bad: tricky to fix
    • Volume code was & is worst: rewrite