1 / 38

Lecture 21: Indexed Files

CSC 213 – Large Scale Programming. Lecture 21: Indexed Files. Today’s Goals. Look at how Dictionary s used in real world Where this would occur & why they are used there In real world setting, what problems can/do occur Indexed file usage presented and shown

keita
Download Presentation

Lecture 21: Indexed Files

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC 213 – Large Scale Programming Lecture 21:Indexed Files

  2. Today’s Goals • Look at how Dictionarys used in real world • Where this would occur & why they are used there • In real world setting, what problems can/do occur • Indexed file usage presented and shown • How & why we split index & data files • Formatting of each file and how they get used • Describe what problems solved using indexed files • Java coding techniques that simplify using these files • Idea needed when using multiple indexes shown

  3. Dictionaries in Real World • Often need large database on many machines • Split search terms across machines • Updating & searching work split between machines • Database way too large for any single machine • If you think about it, this is incredibly common • Where?

  4. Split Dictionaries

  5. Split Dictionaries

  6. Splitting Keys From Values • In real world, we often have many indices • Simple units measure where we can find values • Values could be searched for in multiple ways

  7. Splitting Keys From Values • In real world, we often have many indices • Simple units measure where we can find values • Values could be searched for in multiple ways

  8. Index & Data Files • Split information into two (or more) files • Data file uses fixed-size records to store data • Index files contain search terms & data locations • Fixed-size records usually used in data file • Each record will use exactly that much space • Extra space wasted if the value is smaller • But limits data size, cannot get more space • Makes it far easier to reuse space & rebuild index

  9. Index File Format • No standard format – depends on type of data • Often variable sized, but this not specific requirement • Each entry in index file begins with exact search term • Followed by position containing matching data • As a result, often find indexes smushed together • Can read indexes at start of program execution • Reasonably assumes index file smaller than data file • Changes written immediately, however • When program starts, do NOT read data file

  10. Never Read Entire Data File

  11. Indexed Files • Enables splitting search terms across computers • Alphabetical split searches faster on many servers U-X Y-Z A - C S-T D-E Q-R F-H I-P

  12. Indexed Files • Enables splitting search terms across computers • Create indexes for different types of searching Song name Song Length

  13. How Does This Work? • Using index files simplified using positions • Look in index structure to find position of data in file • With this position can then seek to specific record • Create instance & initialize by reading data from file

  14. Starting with Indexed Files IBM 106 IBM AT & T 23 T Ford 2 F

  15. Where Was "Searching" Used? • Indexed files used in Maps and Dictionarys • Read data into searchable object after opening file • For each record, Entryuses indexed data as its key • Single data file has multiple indexes to search it • Not a problem, each index has own Collection • Cannot have multiple instances for each data item • Cannot have single instance for each data item • Then how can we construct each Entry's value?

  16. Proxy Pattern For The Win!

  17. Proxy Pattern For The Win! • Create proxy instances to use as Entry's value • Proxy pretends has data by defining getters & setters • Data's position & file only fields these objects have • Whenever method called looks up & returns data • Other classes will think proxy has fields declared • Simplifies using class & ensures up-to-date data used • But little memory needed, since data resides on disk!

  18. Starting with Indexed Files IBM 106 IBM AT & T 23 T Ford 12 F

  19. Coding public class Stock {private static final intNAME_OFF = 0;private static finalintNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ = 4;private static final intTICK_OFF = PRC_OFF + PRC_SZ;private static final intTICK_SZ = 6;private static final intSIZE = TICK_OFF + TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile = file;}

  20. Coding public class Stock {private static final intNAME_OFF = 0;private static final intNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ=4;private static final intTICK_OFF= PRC_OFF +PRC_SZ;private static final intTICK_SZ= 6;private static finalintSIZE=TICK_OFF +TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile= file;} Fixed max. sizeof each field Fixed size of a record in data file

  21. Coding public class Stock {private static final intNAME_OFF = 0;private static final intNAME_SZ = 50;private static final intPRC_OFF=NAME_OFF + NAME_SZ;private static final intPRC_SZ=4;private static final intTICK_OFF = PRC_OFF + PRC_SZ;private static final intTICK_SZ=6;private static final intSIZE=TICK_OFF+TICK_SZ;private long position;private RandomAccessFiletheFile;public Stock(long pos, RandomAccessFile file) {position = pos;theFile= file;} Offset in record to field start

  22. Coding public class Stock { // Continues from last timepublic intgetStockPrice() {theFile.seek(position + PRC_OFF); return theFile.readInt();}public void setStockPrice(int price) {theFile.seek(position + PRC_OFF); theFile.writeInt(price);}public void setTickerSymbol(String sym) {theFile.seek(position + TICK_OFFSET);theFile.writeUTF(sym);}// More getters & setters from here…

  23. Visualizing Indexed Files IBM 106 IBM AT & T 23 T Ford 12 F

  24. How Do We Add Data? • Adding new records takes only a few steps • Add space for record with setLength on data file • Update index structure(s) to include new record • Records in data file updated at each change

  25. Adding New Data To The Files IBM 106 IBM AT & T 23 T Ford 12 F 0 Ø

  26. Adding New Data To The Files IBM 106 IBM AT & T 23 T Ford 12 F Citibank -2 C

  27. How Does This Work? • Removing records even easier • To prevent using record, remove items from indexes • Do NOT update index file(s) until program completes • Use impossible magic numbers for record in data file

  28. Removing Data As We Go IBM 106 IBM AT & T 23 T Ford 12 F Citibank -2 C

  29. Removing Data As We Go IBM 106 IBM AT & T 23 T 0 Ø Citibank -2 C

  30. Using Multiple Indexes • Multiple indexes for data file very often needed • Provides many ways of searching for important data • Since file read individually could also create problem • Multiple proxy instances for data could be created • Duplicates of instance are created for each index • Makes removing them all difficult, since not linked • Very easy to solve: use Map while loading index • Converts positions in file to proxy instances to solve this

  31. Linking Multiple Indexes • Use one Map instance while reading all indexes • For each position in file, check if already in Map • Use existing proxy instance, if position already in Map • If a search in Mapreturns null, create new instance • Make sure to call put()when we must create proxy

  32. What to Study for Midterm • Study your Maps and Dictionarys • When would we use each of the ADTs? Why? • What do their methods do? Why do they differ? • Consider each implementation of these ADTs • Explain why method has its given big-Oh complexity • Why use an implementation? Where is it used? • What are negatives or limitations of implementation? • What fields needed by implementation? Why is this?

  33. What to Study for Midterm • Hash tables • How do hash functions work? What does mod do? • How do we add & remove data from hash table? • What are collisions & how do we handle them? • What is real & pretend big-Oh complexity? Why? • Binary Search Trees • How do we add, remove, & search in these trees? • How are data in BSTs organized? Tricks to their use? • How do we code & use BSTs? What methods exist?

  34. What to Study for Midterm • List-based approaches – Why? When? • Hash tables • How do hash functions work? What does mod do? • How do we add & remove data from hash table? • What are collisions & how do we handle them? • What is real & pretend big-Oh complexity? Why? • Binary Search Trees • How do we add, remove, & search in these trees? • How are data in BSTsorganized? Tricks to their use? • How do we code & use BSTs? What methods exist?

  35. What to Study for Midterm • AVL Trees • How do we add, remove, & search in these trees? • How are data in them organized? Tricks to their use? • When must we reorganize tree? How is this done? • Splay Trees • How do we add, remove, & search in these trees? • For each method is node splayed & which one? • How to chain splayings together? When do we stop?

  36. What to Study for Midterm • Class selection & design • Where do classes come from? How do we know? • When to use each connection between classes? • How to list methods & fields in UML class diagram? • Comments & Outlines • When, where, and how much? • What should & should not be included?

  37. Midterm Process • Open-book & open-notetest; do not memorize • But have methods & information at your fingertips • Use my slides ONLY with note(s) on that day's slides • Cannot use daily or weekly activities • Must submit all printed pages along with test • Problems resembles tone of those already seen • All new problems, however; do not memorize answers • Includes tracing, showing state of ADT, method returns • Coding, big-Oh analysis, and more can be asked

  38. For Next Lecture • Midterm #1 in class week on Friday • Project #2 available on Angel on Friday, too • Lab phase #2 due on Friday at midnight • I still will be out of town, but lab activity will be posted • Due week from Friday; chance to use indexed files • No class on Monday; take some time to relax • I will be out-of-town serving on an NSF grant panel • Updated schedule on Angel accounts for change

More Related