1 / 6

Meeting Presentation sept.12

Meeting Presentation sept.12. Things to do since last meeting: (1) find out the number of drug name in FDA website (done, the number is 6244 which is OK for us to do search crawl on twitters). (2) Read papers to find out new ideas about the query cost estimate.

duyen
Download Presentation

Meeting Presentation sept.12

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Meeting Presentation sept.12 Things to do since last meeting: (1) find out the number of drug name in FDA website (done, the number is 6244 which is OK for us to do search crawl on twitters). (2) Read papers to find out new ideas about the query cost estimate. **Predicting query performance **what makes a query difficult, by David Camel **learning to estimate query difficulty, sigir2005 best paper. **Publications of Junghoo "John" Cho

  2. Paper Review Predicting query performance This a great paper since it introduced a new concept named clarity score which can measure the similarity between query model and collection model. It helps us to view query difficulty from a new perspective: the weakness of query terms' ability to distinguish documents may lead query difficulty. what makes a query difficult, by David Camel This is a good development of the previous paper. It expands the concept of clarity score to a higher level concept of “distance model”. Distance does not only apply to query & collection, but also apply to query & relevant documents, relevant documents & collection, etc. What is more, the paper adopt more reasonable function: Jensen-Shannon divergence (JSD).

  3. Paper Review learning to estimate query difficulty The paper offers a new view that sub-query coverage may also affect query difficulty a lot. To support such view, the authors provide two complex machine learning method: histogram and modified decision tree. The result shows that difficult query is likely to be dominated by a single sub-query.

  4. Some Ideas A straight forward idea from David's paper is that we can do query deletion to maximum the distance between query and collection. The idea is not hard to implement. But I am wondering how much improvement we can get through this way.

  5. Some Ideas An advanced idea is to connect it with retrieval cost. As we see, the traditional cost for retrieval is as following: n*(complexity of function*DF(i)) Thus computing cost is easy to be precomputed. It is also interesting to consider deleting low IDF and low clarity terms. It will greatly reduce the computing cost while decrease or even increase the retrieval performance.

  6. Some Ideas It is also interesting to discuss term proximity and query expansion here. In my opinion, term proximity and external query term expansion may help to improve query clarity. The cost of term proximity is about additional: n*(n-1)/2*(DF1+DF2+averageTF1*averageTF2*comDoc) The cost of external query term expansion is about additional: n*(complexity of function*DF(i))+k*averageDoclength+N*(complexity of function*DF(i)) where n is the number of query terms, k is the number of top documents for expansion and N is number of terms expansed. It will be interesting to discuss how many clarity could term proximity and external query term expansion can add.

More Related