1 / 23

USPTO P atent D ata S ource and D ata E xtraction

USPTO P atent D ata S ource and D ata E xtraction. Mandy Dang MIS 580 University of Arizona 02-06-2008. Outline. Patent USPTO Search USPTO Patents D ata E xtraction : Case Study of NSE Patents. Patent.

saburo
Download Presentation

USPTO P atent D ata S ource and D ata E xtraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008

  2. Outline • Patent • USPTO • Search USPTO Patents • Data Extraction: Case Study of NSE Patents

  3. Patent • “Patent" usually refers to a right granted to anyone who invents or discovers any new and useful process, machine, article of manufacture, or composition of matter, or any new and useful improvement. • A patent is not a right to practice or use the invention. Rather, it provides the right to exclude others from making, using, selling, offering for sale, usually 20 years from the filing date. • It is a limited property right that thegovernment offers to inventors in exchange for their agreement to share the details of their inventions with the public. • A patent is a special type of technology document which documents many important innovations and technology advances.

  4. USPTO • The United States Patent and Trademark Office (USPTO) is an agency in the United States Department of Commerce that provides patent protection to inventors and businesses for their inventions, and trademark registration for product and intellectual property identification. • Each year, the USPTO issues thousands of patents to companies and individuals worldwide. As of March 2006, the USPTO has issued over 7 million patents, with 3,500 to 4,500 newly granted patents each week. • USPTO provides online full-text access for patents issued since 1976. • URLs: • USPTO Official Website:http://www.uspto.gov/ • USPTO Patent Search: http://www.uspto.gov/main/search.html

  5. http://www.uspto.gov/main/search.html Search USPTO Patents

  6. Data Extraction: Case Study of NSE Patents • Nanoscale Science and Engineering (NSE) field • Fundamental technology that is critical for a nation’s technological competence. • Revolutionize a wide range of application domains. • Nanotechnology • Is an applied science/ technology field that is multi-disciplinary and encompasses engineering and other work taking place at the nanoscale. • Critical for a nation’s technological competence. • R&D status attracts various communities’ interest.

  7. Data Extraction Procedure • The goal is to gather all the related patents from USPTO Web site as free-text html pages and then parse them into structured data and stored in a database. • Procedure of extracting NSE patents from USPTO: • Spider search results (summary pages) • Spider individual patent documents (detailed pages) • Noise filtering • Parsing

  8. 1. Spider search results (summary pages) • A list of keywords can be used to search for patents related to NSE domain. The keywordswere provided by domain experts. • A spider program written by Perl was used to spider the search result pages.

  9. Example code use HTML::TokeParser; use LWP; use URI::Escape; use strict; sub query { … … … … open(f, $ARGV[0]); my @keywords = <f>; close(f); … … … … $query_url = "http://patft.uspto.gov/netacgi/nphParser?Sect1=PTO2&Sect2=HITOFF&p=$pno&u=%2Fnetahtml%2Fsearc-bool.html&r=0&f=S&l=50&TERM1=$kw&FIELD1=&co1=AND&TERM2=$start%3E$end&FIELD2=ISD&d=ptx"; $response = $browser->get($query_url); $result = $response->content(); open(f, "> $fpage-$pno.html"); select(f); print $result; close(f); } query('1/1/2007', '12/31/2007'); Get keywords Download search pages Set up time range

  10. Search result page example Patent IDs

  11. 2. Spider individual patent documents (detailed pages) • In this step, we need to: • 1st, collect all the patent IDs; • 2nd, download all the patents based on the patent IDs by using proxies. • The data set is often very large, so using proxies can save a lot of time.

  12. Download detailed patent documents Create several files, each of which contains a fixed amount of patent IDs (e.g., 300 patent IDs). … … … … open(f, $ARGV[0]); my @theids = <f>; close(f); my $theid; foreach $theid (@theids){ $new_sock = $sock->accept(); my $buf = <$new_sock>; print ($new_sock $theid."\n"); print $buf . " " . $theid."\n"; close $new_sock; … … … … Server: Send different patent ID files to different client threads. Client: Use proxy to download the patents whose IDs are in the file sent from the server. 1 … … … … do { $response = $browser->get($pat_url); if (!$response->is_success()){ select(stdout); print $response->status_line, "\n\n"; sleep(rand(7)+1); }while (!$response->is_success()) … … … …

  13. Patent document example

  14. 3. Noise filtering • Some patents we gathered may have noisy NSE keywords, some may even have no NSE keywords. • Such patents need to be filtered out. • Noise keywords includes: • nanosecond • nanoliter • nano$ • nano-second • nano-liter • nano.sub • nano [space] • nano2

  15. 4. Parsing • Extract different data fields from the HTML patent documents and parse into database.

  16. Parsing example: parsing inventor data Process inventor name public static void processAssignees() throws IOException { … … … … String[] assignees = assigneeString.split("<BR>"); for (int i = 0; i < assignees.length; i++) { currentassignee=assignees[i].trim(); if(currentassignee.length()==0) continue; currentassignee = currentassignee.replaceAll("\r\n", ""); name =findBetween(currentassignee,0,"<B>","</B>"); currPosition=currentassignee.indexOf("</B>")+"</B>".length(); address=findBetween(currentassignee,currPosition,"(",")"); if(address==null) {System.err.println("wrong address: " + patentId);} int startIndex=0, endIndex=0; if((endIndex = address.lastIndexOf(',')) >= 0) {city = address.substring(0, endIndex); if (city.lastIndexOf(',') >= 0) {city = city.substring(city.lastIndexOf(',') + 1); city.replaceAll("[^a-zA-Z]", ""); } startIndex = endIndex + 1; } else city="-"; address = address.substring(startIndex); country=findBetween(address,0,"<B>","</B>"); if(country==null) {country="US"; state=address.trim(); } else state="-"; name=name.trim(); city=city.trim(); state=state.trim(); rank++; } } Process inventor address Keep the ranking order of inventors

  17. Data Analysis Examples • Bibliographic analysis • Top 50 countries select c.countryName, count(distinct b.patentId) from usp_assignee a, usp_patentAssignee b, usp_countryName c where a.assigneeId=b.assigneeId and a.aCountry not in ('unknown','') and a.aCountry=c.countryCode group by c.countryName order by count(distinct b.patentId)desc

  18. Citation Network Analysis Developing software: Graphviz http://www.pixelglow.com/graphviz/download/

  19. Content Map Analysis Developing software: multi-level self-organizing map algorithm developed by AI Lab at the U of Arizona

  20. Thanks!

More Related