Download
1 / 19

Distributed Namespace Status Phase I - Remote Directories - PowerPoint PPT Presentation


  • 59 Views
  • Uploaded on

Distributed Namespace Status Phase I - Remote Directories. Wang Di Whamcloud, Inc. DNE Phase I - Remote Directory. Subdirectories on a remote metadata target Scales MDT namespace, like OSTs can today Dedicated performance for users/jobs All MDTs can use any/all OSTs to create objects. MDT1.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Distributed Namespace Status Phase I - Remote Directories' - carla-schultz


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Distributed namespace status phase i remote directories
Distributed Namespace StatusPhase I - Remote Directories

  • Wang Di

    Whamcloud, Inc.


Dne phase i remote directory
DNE Phase I - Remote Directory

  • Subdirectories on a remote metadata target

  • Scales MDT namespace, like OSTs can today

  • Dedicated performance for users/jobs

  • All MDTs can use any/all OSTs to create objects

MDT1

MDT0

rose

root

MDT2

dir1

file

file

bill

frank

file

dir2

Lustre User Group 2012


Remote directory implementation
Remote Directory Implementation

  • Remote directory creation by administrator only

    • Remote directory creation is a synchronous disk operation

      lfs mkdir -i {mdtidx} /path/to/remote_dir

  • Files/subdirs created in remote dir stay on MDT

    • Local operations (create, unlink, open, close) at maximum performance

    • Limit RPCs that need to communicate with multiple MDTs

    • Simplifies implementation for initial deployment

Lustre User Group 2012


Remote directory limitations
Remote Directory Limitations

  • Failed/disabled MDT affects all of its subtrees

    • Accessing failed/disabled MDT will return EIO

    • Disabling MDT0 causes whole namespace to be inaccessible

  • Remote directory can only be created on MDT0

    • Otherwise, failure of one MDT would isolate other MDTs

  • Rename or link across MDTs returns –EXDEV

  • Deliberate limitation of complexity

    • Limit testing, recovery, failure scenarios for initial deployment

    • Restrictions relaxed as experience is gained, or via override

Lustre User Group 2012


Enable dne on new existing filesystem
Enable DNE on new/existing filesystem

  • MDT disk format must use ldiskfs dir_data feature

    • Default for any 2.x formatted filesystem

    • Allows storing remote directory entry pointers

    • Enable on 1.x filesystems: tune2fs -O dir_data /dev/mdt0

  • Upgrade clients, MGS, MDS, OSS to Lustre 2.4+

    • Not required to enable DNE when upgrading to Lustre 2.4+

    • Once DNE is enabled, downgrade to older Lustre difficult

      • requires copying/deleting all files not on MDT0

  • Add new MDTs to running filesystem

    • Clients without DNE support evicted at this point

    • New MDTs only used once a remote directory entry is created

      mkfs.lustre --reformat --mgsnode={mgsnode} --mdt --index=N /dev/{mdtN}

      mount –t lustre /dev/{mdtN} /mnt/{mdtN}

Lustre User Group 2012


Dne phase ii shard stripe directory
DNE Phase II - Shard/Stripe Directory

  • Hash a single directory across multiple MDTs

  • Multiple servers active for directory/inodes

  • Improve performance for large directories

master

MDT1

dir.0

slave

MDT2

slave

slave

MDT0

MDT3

dir.2

dir.3

dir.1

cat

car

ace

ale

dale

dog

bob

bee

Lustre User Group 2012


Lustre file identifier fid
Lustre File IDentifier (FID)

32 bits

32 bits

64 bits

  • Unique cluster-wide identifier for file/directory

    • Introduced in Lustre 2.0

    • Three components form object address {f_seq, f_oid, f_ver}

    • Large sequence range is allocated to each server

    • Sequences are large, so FIDs are never re-used

  • FID Location Database (FLDB) maps FID->server

    • FLDB is known to all clients and servers

    • Kept small due to few sequence ranges

    • Sequence is looked up in FLDB to find MDT/OST index

  • Object Index (OI) maps FID->inode on server

    • OI maps FID to local inode number

Sequence #

Object #

Version

Lustre User Group 2012


Dne master and slave mdts
DNE Master and Slave MDTs

  • Client does filename lookup in parent directory

    • Root directory lives on MDT0

  • Client maps FID to Master MDT via FLDB

    • If request only involves one MDT, same as current single MDT

  • Some operations need to access Slave MDTs

    • Called cross-MDT operations

    • Master MDT forwards update(s) other MDT(s) to finish the request

    • Create/unlink remote directory are only cross-MDT operations today

Get Master MDT for this operation

FLDB

Slave MDT2

Master MDT

request

client

reply

Lustre User Group 2012


Dne operation
DNE Operation

  • Create Remote Directory

Lustre User Group 2012


Create resend between mdts
Create Resend between MDTs

  • Master MDT checks RPC XID against last_rcvd file

    • Determines whether the operation was committed to disk or not

    • Committed: Master MDT reconstructs RPC reply from last_rcvd entry

    • Uncommitted: Master MDT redoes creation

      • Resend same directory creation RPC to Slave MDT using same FID

  • Slave MDT checks if remote directory was created

    • Looks up FID requested by Master in local OI

    • Creates new subdirectory with FID if missing

    • Returns success to Master

Lustre User Group 2012


Dne operation1
DNE Operation

  • Unlink Remote Directory

Lustre User Group 2012


Unlink resend between mdts
Unlink Resend between MDTs

  • Master MDT checks RPC XID against last_rcvd file

    • Determines whether the operation was committed to disk or not

    • Committed: Master MDT reconstructs RPC reply from last_rcvd entry

    • Uncommitted: Master unlinks, deletes name, adds destroy log, etc.

  • If Slave MDT fails during this process

    • llog sync thread on Master MDT will resend destroy to Slave MDT

    • Directory unlinks are idempotent, can be retried

Lustre User Group 2012


Remote directory entry
Remote Directory Entry

  • FID is packed into the name entry

  • Each remote entry will have a local agent inode

  • Real object (inode) on Remote MDT found via OI

Lustre User Group 2012


Mdt disk layout
MDT Disk Layout

  • Two directories (AGENT and REMOTE) added

  • AGENT

    • Each remote entry has a local agent inode

    • Agent inodes located under /AGENT/MDTn, one for each remote MDT

  • REMOTE

    • Remote directories on Slave MDT created under /REMOTE

  • Keeps local disk filesystem consistent

  • Allows efficient checking of cross links by LFSCK

Lustre User Group 2012


Dne high availability
DNE High Availability

  • Active-Active MDT failover available with DNE

    • Allows multiple MDTs to be exported from one MDS

    • Ensures file system remains available in face of MDS node failure

    • Prevents isolation of large parts of the filesystem

MDS1

MDS2

Take over

MDT1

MDT2

Lustre User Group 2012


Internal architecture
Internal Architecture

Lustre User Group 2012


Early test results
Early Test Results

  • Testing done on LLNL Hyperion

    • 100 clients, 8 mount points

    • Separate directory per mount point

    • One stripe per file

Lustre User Group 2012


Thank you
Thank You

  • Wang Di

    Whamcloud, Inc.


ad