1 / 33

Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers

Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers. Raphael Hoffmann, James Fogarty, Daniel S. Weld ...

DoraAna
Download Presentation

Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Assieme: Finding and Leveraging Implicit References in a Web Search Interface for Programmers

    Raphael Hoffmann, James Fogarty, Daniel S. Weld University of Washington, Seattle UIST 2007 I am Raphael Hoffmann and this is joint work with James Fogarty and Dan Weld at the University of Washington. My talk is about Assieme – a new Web Search Interface for programmers.I am Raphael Hoffmann and this is joint work with James Fogarty and Dan Weld at the University of Washington. My talk is about Assieme – a new Web Search Interface for programmers.

    Slide 2:Programmers Use Search

    To identify an API To seek information about an API To find examples on how to use an API “Programmatically output an Acrobat PDF file in Java.” Example Task: This talk will extend this list. It is about search as performed by programmers. As we confirmed in interviews with programmers, they frequently search the Web to identify an API (that they can use in their project), to seek more information about an API (such as documentation pages), to find examples on how to use an API (many pages contain short code snippets that are very valuable to programmers) This talk will extend this list. It is about search as performed by programmers. As we confirmed in interviews with programmers, they frequently search the Web to identify an API (that they can use in their project), to seek more information about an API (such as documentation pages), to find examples on how to use an API (many pages contain short code snippets that are very valuable to programmers)

    Slide 3:Example: General Web Search Interface

    Let’s first look again at our example of outputting an acrobat PDF file in Java. We could use a general Web Search interface and search for “output acrobat”. This query is too general, so let’s add the keyword “java”. Still nothing relevant. So let’s modify the query to “output pdf java”. Ok, the first two hits seem very relevant. The first one is a long article on using an API for generating pdf output in Java. The article also contains some code snippets. It says the first step is to create a document object. However, the code sample is incomplete. It doesn’t say which package contains class Document, and we also cannot look up documentation of document, for example to choose a different constructor that better suits our needs. We could do a new search (here we added the class name to our last query – that didn’t work so well, perhaps we could try the name of the library). However, looking again carefully at our article we might also find a link to more information about the library. Navigating to another 4 pages finally brings us to the information we are looking for. In summary: A general web search engine certainly gets us the information we need, but it might take many page visits and searches. So far we have only located a small piece of information about a single API that we might use.Let’s first look again at our example of outputting an acrobat PDF file in Java. We could use a general Web Search interface and search for “output acrobat”. This query is too general, so let’s add the keyword “java”. Still nothing relevant. So let’s modify the query to “output pdf java”. Ok, the first two hits seem very relevant. The first one is a long article on using an API for generating pdf output in Java. The article also contains some code snippets. It says the first step is to create a document object. However, the code sample is incomplete. It doesn’t say which package contains class Document, and we also cannot look up documentation of document, for example to choose a different constructor that better suits our needs. We could do a new search (here we added the class name to our last query – that didn’t work so well, perhaps we could try the name of the library). However, looking again carefully at our article we might also find a link to more information about the library. Navigating to another 4 pages finally brings us to the information we are looking for. In summary: A general web search engine certainly gets us the information we need, but it might take many page visits and searches. So far we have only located a small piece of information about a single API that we might use.

    Slide 4:Example: Code-Specific Web Search Interface

    … There also exist numerous code-specific search engines on the Web now. Let’s try one of them. Here we search for “output acrobat” and we restrict the results to Java code by adding “lang:java”. We get some results, but the page summaries are confusing, so let’s click on one. A long copyright notice and a little bit of code which is totally irrelevant. Let’s again change our query to “output pdf” and restrict the results by adding “lang:java”. Again, confusing summaries, copyright notices, and code that is irrelevant. We stop at this point, but say that it is very difficult to obtain the information we need using existing code-specific search engines. It is far easier with general Web search engines, because much of the information we are looking at already exists on Web pages that have been manually crafted by humans. There also exist numerous code-specific search engines on the Web now. Let’s try one of them. Here we search for “output acrobat” and we restrict the results to Java code by adding “lang:java”. We get some results, but the page summaries are confusing, so let’s click on one. A long copyright notice and a little bit of code which is totally irrelevant. Let’s again change our query to “output pdf” and restrict the results by adding “lang:java”. Again, confusing summaries, copyright notices, and code that is irrelevant. We stop at this point, but say that it is very difficult to obtain the information we need using existing code-specific search engines. It is far easier with general Web search engines, because much of the information we are looking at already exists on Web pages that have been manually crafted by humans.

    Slide 5:Problems

    Information is dispersed: tutorials, API itself, documentation, pages with samples Difficult and time-consuming to … locate required pieces, get an overview of alternatives, judge relevance and quality of results, understand dependencies. Many page visits required Unfortunately, obtaining such information is not always straightforward: One problem is that the information is dispersed; there exist tutorials, API itself in source or binary format, documentation pages in Javadoc format, or simply pages with code examples, such as articles or messages in forums It is therefore often difficult and time-consuming to … Programmers rarely find all information on one page: they must visit many pages and perform multiple searches. Unfortunately, obtaining such information is not always straightforward: One problem is that the information is dispersed; there exist tutorials, API itself in source or binary format, documentation pages in Javadoc format, or simply pages with code examples, such as articles or messages in forums It is therefore often difficult and time-consuming to … Programmers rarely find all information on one page: they must visit many pages and perform multiple searches.

    Slide 6:With Assieme we …

    Designed a new Web search interface Developed needed inference In this talk we present Assieme – a new Web search interface designed to overcome these limitations. Assieme attemps to display all required pieces of information in a single view, and offers powerful capabilities in browsing code-related information on the Web. To do this it needs to perform some interesting inference – which I will talk about later.In this talk we present Assieme – a new Web search interface designed to overcome these limitations. Assieme attemps to display all required pieces of information in a single view, and offers powerful capabilities in browsing code-related information on the Web. To do this it needs to perform some interesting inference – which I will talk about later.

    Slide 7:Outline

    Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion How did we come up with this interface? We were initially interested in finding out about information needs of programmers. How did we come up with this interface? We were initially interested in finding out about information needs of programmers.

    Slide 8:Six Learning Barriers faced by Programmers (Ko et al. 04)

    Design barriers — What to do? Selection barriers — What to use? Coordination barriers — How to combine? Use barriers — How to use? Understanding barriers — What is wrong? Information barriers — How to check? A work that tries to answer this question is Andy Ko and Brad Myers’ paper on six learning barriers faced by programmers. There are design barriers, when programmers do not know what to do such as to conceive of an appropriate algorithm, Selection barriers, what to use, … For at least three kinds of barriers, programmers can do Web search – and these are exactly those related to APIs and those that we are addressing with Assieme.A work that tries to answer this question is Andy Ko and Brad Myers’ paper on six learning barriers faced by programmers. There are design barriers, when programmers do not know what to do such as to conceive of an appropriate algorithm, Selection barriers, what to use, … For at least three kinds of barriers, programmers can do Web search – and these are exactly those related to APIs and those that we are addressing with Assieme.

    Slide 9:Examining Programmer Web Queries

    Objective See what programmers search for Next, we wanted to find out this is consistent with queries performed by programmers on Web search engines. We used a dataset of 15 million queries and … And filtered all query sessions containing at least one query with the keyword ‘java’ … Next, we wanted to find out this is consistent with queries performed by programmers on Web search engines. We used a dataset of 15 million queries and … And filtered all query sessions containing at least one query with the keyword ‘java’ …

    Slide 10:Examining Programmer Web Queries

    These are the results we got. The sizes of the circles correspond to the relative number of queries. Indeed the largest category are API related queries (followed by troubleshooting – e.g. error messages).These are the results we got. The sizes of the circles correspond to the relative number of queries. Indeed the largest category are API related queries (followed by troubleshooting – e.g. error messages).

    Slide 11:Examining Programmer Web Queries

    Descriptive Contain package, type or member name Contain terms like “example”, “using”, “sample code” 64.1 % 35.9 % 17.9 % “java JSP current date” “java SimpleDateFormat” “using currentdate in jsp” Selection barrier Use barrier Coordination barrier Looking more closely at API-related queries, we found that 64% contained merely descriptive keywords … presumably intended to identify an appropriate API, … From the complete set of APIs, 18% contained terms like example, using,… Looking more closely at API-related queries, we found that 64% contained merely descriptive keywords … presumably intended to identify an appropriate API, … From the complete set of APIs, 18% contained terms like example, using,…

    Slide 12:Assieme

    example code documentation required libaries relevance indicated by # uses Summaries show referenced types links to related info Finally, let’s look at Assieme. We search for “output acrobat”, and the system returns some pages and also a few API packages/types/members that might be related to those keywords. We click on one which filters our set of pages to those only containing code examples using that API. All required information is now visible: Links to Javadoc, required libraries, example code, hovering, example counts tell us relevance.Finally, let’s look at Assieme. We search for “output acrobat”, and the system returns some pages and also a few API packages/types/members that might be related to those keywords. We click on one which filters our set of pages to those only containing code examples using that API. All required information is now visible: Links to Javadoc, required libraries, example code, hovering, example counts tell us relevance.

    Slide 13:Challenges

    How to put the right information on the interface ? Get all programming-related data Interpret data and infer relationships

    Slide 14:Outline

    Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion Let’s now talk about the Assieme Search Engine in more detail.Let’s now talk about the Assieme Search Engine in more detail.

    Slide 15:Assieme’s Data

    … is crawled using existing search engines Pages with code examples JAR files JavaDoc pages Queried Google on “java ±import ±class …” Queried Google on “overview-tree.html …” Downloaded library files for all projects on Sun.com, Apache.org, Java.net, SourceForge.net ~2,360,000 ~79,000 ~480,000 Finally, let me briefly talk about Assieme’s data, which is crawled using existing search engines. Separately for pages with code examples, JAR files, and JavaDoc pages.Finally, let me briefly talk about Assieme’s data, which is crawled using existing search engines. Separately for pages with code examples, JAR files, and JavaDoc pages.

    Slide 16:The Assieme Search Engine

    … infers 2 kinds of implicit references JAR files JavaDoc pages Pages with code examples Uses of packages, types and members Matches of packages, types and members ? ? To power its interface, Assieme infers two kinds of implicit references: One are uses of packages/types/members in code examples on web pages. Those packages/types/members are contained in JAR files. The other kind are packages/types/members in JAR files and their respective Javadoc documentation pages. It turns out that inferring the second kind of references is relatively easy. Javadoc pages are automatically generated from source code, so it is not difficult to parse them and to re-create the matching to the contents in JAR files. The other kind is more involved – and this will be what I will be focusing on. To find out about uses in code examples, we must first extract code examples and then resolve references.To power its interface, Assieme infers two kinds of implicit references: One are uses of packages/types/members in code examples on web pages. Those packages/types/members are contained in JAR files. The other kind are packages/types/members in JAR files and their respective Javadoc documentation pages. It turns out that inferring the second kind of references is relatively easy. Javadoc pages are automatically generated from source code, so it is not difficult to parse them and to re-create the matching to the contents in JAR files. The other kind is more involved – and this will be what I will be focusing on. To find out about uses in code examples, we must first extract code examples and then resolve references.

    Slide 17:Extracting Code Samples

    Extracting code examples from web pages is not trivial.Extracting code examples from web pages is not trivial.

    Slide 18:Extracting Code Samples

    ? ? ? remove HTML commands, but preserve line breaks remove some distracters by heuristics launch (error-tolerant) Java parser at every line break (separately parse for types, methods, and sequences of statements) <html> <head><title></title></head> <body> A simple example:<br><br> 1: import java.util.*; <br>2: class c {<br>3: HashMap m = new HashMap();<br>4: void f() { m.clear(); }<br>5: }<br><br> <a href=“index.html”>back</a> </body> </html> <html> <head><title></title></head> <body> A simple example:<br><br> 1: import java.util.*; <br>2: class c {<br>3: HashMap m = new HashMap();<br>4: void f() { m.clear(); }<br>5: }<br><br> <a href=“index.html”>back</a> </body> </html> A simple example: 1: import java.util.*; 2: class c { 3: HashMap m = new HashMap(); 4: void f() { m.clear(); } 5: } back A simple example: 1: import java.util.*; 2: class c { 3: HashMap m = new HashMap(); 4: void f() { m.clear(); } 5: } back A simple example: import java.util.*; class c { HashMap m = new HashMap(); void f() { m.clear(); } } back A simple example: import java.util.*; class c { HashMap m = new HashMap(); void f() { m.clear(); } } back Assieme extracts code examples by first removing all html commands from a page, while preserving line breaks. It then uses some heuristics to remove distracters. And finally launches a Java parser at every line break and separately attempts to parse for types, methods, and sequences of statements. The end of a code snippet is determined by tracking the state of the parser. There are more details about this in the paper.Assieme extracts code examples by first removing all html commands from a page, while preserving line breaks. It then uses some heuristics to remove distracters. And finally launches a Java parser at every line break and separately attempts to parse for types, methods, and sequences of statements. The end of a code snippet is determined by tracking the state of the parser. There are more details about this in the paper.

    Slide 19:Resolving External Code References

    Naïve approach of finding term matches does not work: 1 import java.util.*; 2 class c { 3 HashMap m = new HashMap(); 4 void f() { m.clear(); } 5 } Reference java.util.HashMap.clear() on line 4 only detectable by considering several lines ? ? Use compiler to identify unresolved names After it has extracted code examples, Assieme needs to resolve references to external APIs. A naïve approach one could try is to search for pure term matches. Unfortunately, this doesn’t work. In this small example, line 4 contains a reference to java.util.Hashmap.clear(), but this is only detectable by combining information from several lines. We therefore use a compiler to identify unresolved names.After it has extracted code examples, Assieme needs to resolve references to external APIs. A naïve approach one could try is to search for pure term matches. Unfortunately, this doesn’t work. In this small example, line 4 contains a reference to java.util.Hashmap.clear(), but this is only detectable by combining information from several lines. We therefore use a compiler to identify unresolved names.

    Slide 20:Resolving External Code References

    Index packages/types/members in Jar files Utility function: # covered references (and JAR popularity) java.util.HashMap.clear() java.util.HashMap … More specifically, we first index all packages/types/members contained in JAR files in Assieme’s data repository. Then, when we resolve external references, we first compile code snippets – which gives us a set of unresolved names. Then we do an index lookup, and put the JAR files that contain the required objects onto the classpath and attempt a re-compilation. We repeat this until we make no further progress. However, often an object with a given name is contained in many different JAR files (e.g. different versions).More specifically, we first index all packages/types/members contained in JAR files in Assieme’s data repository. Then, when we resolve external references, we first compile code snippets – which gives us a set of unresolved names. Then we do an index lookup, and put the JAR files that contain the required objects onto the classpath and attempt a re-compilation. We repeat this until we make no further progress. However, often an object with a given name is contained in many different JAR files (e.g. different versions).

    Slide 21:Scoring

    Existing techniques … Docs modeled as weighted term frequencies Hypertext link analysis (PageRank) We now discuss how Assieme makes use of the implicit references it infers. Existing techniques for scoring documents (such as modeling documents as vectors of weighted term frequencies) or differentially weighting important documents by hypertext link analysis) do not work well for code, because JAR files contain few keywords and therefore lack context Source code contains few relevant keywords Structure in code (e.g. number of uses of objects) are important for relevanceWe now discuss how Assieme makes use of the implicit references it infers. Existing techniques for scoring documents (such as modeling documents as vectors of weighted term frequencies) or differentially weighting important documents by hypertext link analysis) do not work well for code, because JAR files contain few keywords and therefore lack context Source code contains few relevant keywords Structure in code (e.g. number of uses of objects) are important for relevance

    Slide 22:Using Implicit References to Improve Scoring

    Assieme exploits structure on Web pages HTML hyperlinks Assieme tries to exploit structure on web pages (below here we see a graph of Web documents and hyperlinks between them), and structure in code (documents on the Web can be API’s or web pages with code samples, and there exist implicit references between them).Assieme tries to exploit structure on web pages (below here we see a graph of Web documents and hyperlinks between them), and structure in code (documents on the Web can be API’s or web pages with code samples, and there exist implicit references between them).

    Slide 23:Scoring

    APIs (packages/types/members) Web pages Assieme actually contains two scoring functions – one for API’s and one for Web pages.Assieme actually contains two scoring functions – one for API’s and one for Web pages.

    Slide 24:Scoring

    APIs Use text on doc pages and on pages with code samples that reference API (~ anchor text) Weight APIs by #incoming refs (~ PageRank) For scoring API’s, Assieme uses the text that appears on documentation pages and also the text on pages with code samples that use an API. This is similar to the technique of anchor text scoring (the different being that Assieme uses implicit references rather than hyperlinks). Also, Assieme weights APIs by # of incoming references (this is similar to PageRank, but again using implicit references rather than hyperlinks). For scoring Web pages, Assieme uses not only the terms on the page, but also fully qualified references with weights adjusted to their frequency. Assieme also allows to filter web pages by implicit references and it favors pages with accompanying text rather than pure code.For scoring API’s, Assieme uses the text that appears on documentation pages and also the text on pages with code samples that use an API. This is similar to the technique of anchor text scoring (the different being that Assieme uses implicit references rather than hyperlinks). Also, Assieme weights APIs by # of incoming references (this is similar to PageRank, but again using implicit references rather than hyperlinks). For scoring Web pages, Assieme uses not only the terms on the page, but also fully qualified references with weights adjusted to their frequency. Assieme also allows to filter web pages by implicit references and it favors pages with accompanying text rather than pure code.

    Slide 25:Outline

    Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion

    Slide 26:Evaluating Code Extraction and Reference Resolution

    … on 350 hand-labeled pages from Assieme’s data Code Extraction Recall 96.9%, Precision 50.1% (? 76.7%) False positives: C, C#, JavaScript, PHP, FishEye/diff (After filtering pages without refs: precision 76.7%) To evaluate Assieme, we first analyzed the effectiveness of Assieme’s inference components and then performed a user study. We hand-labeled 350 pages from Assieme’s data. For code extraction Assieme reaches a recall of 96.9% at a precision of 50.1%. While recall is important, precision is of less concern here, because Assieme later filters pages without refs which increases precision to 76.7%. For reference resolution, Assieme reaches a recall of 89.6% at a precision of 86.5%. To evaluate Assieme, we first analyzed the effectiveness of Assieme’s inference components and then performed a user study. We hand-labeled 350 pages from Assieme’s data. For code extraction Assieme reaches a recall of 96.9% at a precision of 50.1%. While recall is important, precision is of less concern here, because Assieme later filters pages without refs which increases precision to 76.7%. For reference resolution, Assieme reaches a recall of 89.6% at a precision of 86.5%.

    Slide 27:User Study

    Assieme vs. Google vs. Google Code Search Design 40 search tasks based on queries in logs: query “socket java” ? “Write a basic server that communicates using Sockets” Find code samples (and required libraries) 4 blocks of 10 tasks: 1 for training + 1 per interface In our user study, we compared Assieme to Google and Google Code Search. We developed 40 search tasks based on queries found in the query logs discussed earlier. For example, from the query for “socket java” we developed a search task “Write a basic server that communicates using Sockets”. Other tasks included loading a JPEG image, parsing an XML file. In our user study, we compared Assieme to Google and Google Code Search. We developed 40 search tasks based on queries found in the query logs discussed earlier. For example, from the query for “socket java” we developed a search task “Write a basic server that communicates using Sockets”. Other tasks included loading a JPEG image, parsing an XML file.

    Slide 28:User Study – Task Time F(1,258)=5.74 p ˜ .017 F(1,258)=1.91 p ˜ .17 * significant

    Slide 29:User Study – Solution Quality

    0 seriously flawed .5 generally good but fell short in critical regard 1 fairly complete F(1,258)=55.5 p < .0001 F(1,258)=6.29 p ˜ .013 * *

    Slide 30:User Study – # Queries Issued

    F(1,259)=9.77 p ˜ .002 F(1,259)=6.85 p ˜ .001 * *

    Slide 31:Outline

    Motivation What Programmers Search For The Assieme Search Engine Inferring Implicit References Using Implicit References for Scoring Evaluation of Inference & User Study Discussion & Conclusion

    Slide 32:Discussion & Conclusion

    Assieme – a novel web search interface Programmers obtain better solutions, using fewer queries, in the same amount of time Using Google subjects visited 3.3 pages/task, using Assieme only 0.27 pages, but 4.3 previews Ability to quickly view code samples changed participants’ strategies In this talk we presented Assieme – a novel web search interface. We showed that using Assieme, programmers obtain better solutions, using fewer queries, in the same amount of time. We expected that programmers would need fewer queries because Assieme combines much information. But it is interesting that programmers also obtained better solutions. Looking at click-through data we found that using Google subjects visited 3.3 pages/task, using Assieme only 0.27 pages, but Assieme shows previews of code snippets on a page and when we count the number of previews they saw, they actually looked at 4.3 previews per task. It thus seems that the ability to very quickly view code examples changed participant’s strategies. Using Google, they often took the first code example and prepared a solution. Using Assieme, the ease of viewing many examples, encouraged them to continue exploring to find the best one. In this talk we presented Assieme – a novel web search interface. We showed that using Assieme, programmers obtain better solutions, using fewer queries, in the same amount of time. We expected that programmers would need fewer queries because Assieme combines much information. But it is interesting that programmers also obtained better solutions. Looking at click-through data we found that using Google subjects visited 3.3 pages/task, using Assieme only 0.27 pages, but Assieme shows previews of code snippets on a page and when we count the number of previews they saw, they actually looked at 4.3 previews per task. It thus seems that the ability to very quickly view code examples changed participant’s strategies. Using Google, they often took the first code example and prepared a solution. Using Assieme, the ease of viewing many examples, encouraged them to continue exploring to find the best one.

    Slide 33:Thank You

    Raphael Hoffmann Computer Science & Engineering University of Washington raphaelh@cs.washington.edu James Fogarty Computer Science & Engineering University of Washington jfogarty@cs.washington.edu Daniel S. Weld Computer Science & Engineering University of Washington weld@cs.washington.edu This material is based upon work supported by the National Science Foundation under grant IIS-0307906, by the Office of Naval Research under grant N00014-06-1-0147, SRI International under CALO grant 03-000225 and the Washington Research Foundation / TJ Cable Professorship.

    Slide 34:Search is fundamental in modern User Interfaces

    Visualizing search results [Paek et al. 04] Finding personal information [Cutrell et al. 06] Augmenting structured sites [Huynh et al. 06] Summarizing search sessions [Dontcheva et al. 06] Invoking commands in programs [Little et al. 06] Let me start my talk by saying that search is now fundamental in modern user interface software. We have seen numerous ideas on search at UIST and CHI in recent years; Among them work on visualizing search results (for example the WaveLens project), finding personal information (the Phlat project), augmenting structure web sites with filtering and sorting capabilities on the client-side, Summarizing search sessions, Invoking keyword commands in desktop applicationsLet me start my talk by saying that search is now fundamental in modern user interface software. We have seen numerous ideas on search at UIST and CHI in recent years; Among them work on visualizing search results (for example the WaveLens project), finding personal information (the Phlat project), augmenting structure web sites with filtering and sorting capabilities on the client-side, Summarizing search sessions, Invoking keyword commands in desktop applications

    Slide 35:User Study - Feedback

More Related