1 / 20

Distributed Protein Structure Analysis

Distributed Protein Structure Analysis. By Jeremy S. Brown Travis E. Brown. The Problem. An exhaustive search of proteins against a known database Each string is between 400 and 600 characters long

curtisb
Download Presentation

Distributed Protein Structure Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Protein Structure Analysis By Jeremy S. Brown Travis E. Brown

  2. The Problem • An exhaustive search of proteins against a known database • Each string is between 400 and 600 characters long • Comparing a 10,000 strings against 1,000,000 random strings would take 44 days with a 1Ghz processor

  3. Why an exhaustive search? • Initial intent was to analyze proteins to determine 3-Dimensional structure • Exhaustive search is required to ensure that the match that found is the best match

  4. Solution • Distribute the search among many PCs to obtain an answer faster. • Solution raises more problems, however

  5. How to Distribute • Distributing search strings is not enough • Also distribute the search space • Must find efficient way to distribute 1GB of data without duplication

  6. Program Details • Client/Server architecture • Uses proprietary protocol over TCP/IP to distribute data • Server uses SQL database to store a list of ‘jobs’ and ‘known’ sequences

  7. Program Details • Server issues a single ‘job’ to each client upon request • Client may also request a batch of data for comparison. • Server marks which data has been sent to clients and avoids resending that data to new clients

  8. The Server

  9. The Client

  10. Problems • Server is slow at updating its database. • This is only seen once for each client however.

  11. Performance Analysis • 1 client = 44 days (best algorithm) • 2 clients = 26 days • 3 clients = 19 days • Adding more than 1 client increases time almost linearly, though distribution is expensive

  12. Graphs

  13. Graphs

  14. Graphs

  15. Graphs

  16. Notes about the graphs • Graphs do not include initial distribution, since this is only done once per client • If search data distribution were to be included, efficiency would start at about 70% and increase to ~90% over time

  17. Verifying Data Accuracy • Add an entry into the search table with a known score • Ensure that the result returned by the client is the known entry in the database

  18. Lessons Learned • Reading and writing single elements to an SQL database can be very expensive. • Even the best designs aren’t perfect, especially when the problem is not fully understood.

  19. Notes • Biggest problem was distribution of data • Distribution was very costly, so we tried to reuse data that we already distributed • Program is pluggable, so any comparison algorithm can be used

  20. Question/Comments

More Related