1 / 25

HDFS Hadoop Distributed File System

HDFS Hadoop Distributed File System. 100062123 柯 懷貿 100062139 王建鑫 101062401 彭偉慶. Outline. Introduction HDFS – How it works Pros and Cons Conclusion. Introduction to HDFS. H adoop D istributed F ile S ystem. Cloud Computing JAVA Processing PB-Level Data

Download Presentation

HDFS Hadoop Distributed File System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HDFSHadoop Distributed File System 100062123柯懷貿 100062139王建鑫 101062401彭偉慶

  2. Outline • Introduction • HDFS – How it works • Pros and Cons • Conclusion 柯懷貿

  3. Introduction to HDFS HadoopDistributed File System • Cloud Computing • JAVA • Processing PB-Level Data • Distributed Computing Environment • Allow files shared via internet • Write-once-read-many • Restricting access • Replication & Fault tolerance • Mapping between logical objects & physical objects • Dung Cutting established • Nutch Project • File System for Hadoop framework • Remote Procedure Call • Master/Slave • Yahoo! has accomplished 10,000-core Hadoop cluster in 2008 • HDFS • HadoopMapReduce • HBase 柯懷貿

  4. MapReduce 柯懷貿

  5. HBase • NoSQL • Using several servers to store PB-level data 柯懷貿

  6. HDFS • Distributed, scalable, and portable • File replication(default : 3) • Reading efficacy 柯懷貿

  7. 王建鑫

  8. HDFS major roles • Client(user) – read/write data from/to file system • Name node(masters) – oversee and coordinate the data storage function, receive instructions from Client • Data node(slaves) – store data and run computations, receive instructions from Namenode 王建鑫

  9. 王建鑫

  10. 王建鑫

  11. Rack Awareness 王建鑫

  12. 王建鑫

  13. 王建鑫

  14. 王建鑫

  15. 王建鑫

  16. 王建鑫

  17. 王建鑫

  18. HDFS fault tolerance • Node failure – data node or namenode is dead • Communication failure – cannot send and retrieve data • Data corruption – data corrupted while sending over network or corrupted in the hard disks • Write failure – the data node which is ready to be written is dead • Read failure - the data node which is ready to be read is dead 王建鑫

  19. 王建鑫

  20. Detect the Network failure • Whenever data is sent, an ACK is replied by the receiver • If the ACK is not received(after several retries), the sender assumes that the host is dead, or the network has failed • Also Checksum is sent along with transmitted data→can detect corrupt data when transferring 王建鑫

  21. Handling the write/read failure • Client write the block in smaller data units(usually 64KB) called packet • Each data node replies back an ACK for each packet to confirm that they got the packet • If client don’t get the ACKs from some nodes, dead node detected • Client then adjust the pipeline to skip that node(then?) • Handling the read failure:just read another node 王建鑫

  22. Handling the write failure cont’d • Name node contains two tables: • List of blocks – blockA in dn1, dn2,dn8;blockB in dn3, dn7, dn9… • List of Data nodes – dn1 has blockA, blockD;dn2 has blockE, blockG… • Name node check list of blocks to see if a block is not properly replicated • If so, ask other data nodes to copy block from data nodes that have the replication. 王建鑫

  23. Pros • Very large files • A file size overs xxxMB, GB, TB, PB .….. • Streaming data access • Write-once, read-many. • Efficient on reading whole dataset. • Commodity hardware • High reliability and availability. • Doesn’t require expensive, highly reliable hardware. 彭偉慶

  24. Cons 彭偉慶

  25. Conclusion • HDFS -an Apache Hadoop subproject. • Highly fault-tolerant and is designed to be deployed on low-cost hardware. • High throughputbut not low latency. 彭偉慶

More Related