field profiling n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Field Profiling PowerPoint Presentation
Download Presentation
Field Profiling

Loading in 2 Seconds...

play fullscreen
1 / 121
bernard-peck

Field Profiling - PowerPoint PPT Presentation

119 Views
Download Presentation
Field Profiling
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Field Profiling

  2. Productivity Top Journals Top Researchers Measuring Scholarly Impact in the field of Semantic Web Data: 4,157 papers with 651,673 citations from Scopus (1975-2009), and 22,951 papers with 571,911 citations from WOS (1960-2009)

  3. Impact through citation Impact Top Journals Top Researchers

  4. Rising Stars • In WOS, M. A. Harris (Gene Ontology-related research), T. Harris (design and implementation of programming languages) and L. Ding (Swoogle – Semantic Web Search Engine) are ranked as the top three authors with the highest increase of citations. • In Scopus, D. Roman (Semantic Web Services), J. De Bruijn (logic programming) and L. Ding (Swoogle) are ranked as top three for the significant increase in number of citations. Ding, Y. (2010). Semantic Web: Who is Who in the field, Journal of Information Science, 36(3): 335-356.

  5. Week 10 Data collection

  6. S519 Steps • Step 1: • Data collection • Using journals • Using keywords • Example • INFORMATION RETRIEVAL, INFORMATION STORAGE and RETRIEVAL, QUERY PROCESSING, DOCUMENT RETRIEVAL, DATA RETRIEVAL, IMAGE RETRIEVAL, TEXT RETRIEVAL, CONTENT BASED RETRIEVAL, CONTENT-BASED RETRIEVAL, DATABASE QUERY, DATABASE QUERIES, QUERY LANGUAGE, QUERY LANGUAGES, and RELEVANCE FEEDBACK.

  7. S519 Web of Science • Go to IU web of science • http://libraries.iub.edu/resources/wos • For example, • Select Core Collection • search “information Retrieval” for topics, for all years

  8. S519 Web of Science

  9. S519 Output

  10. S519 Output

  11. S519 Python Script for conversion #!/usr/bin/env python # encoding: utf-8 """ conwos.py convert WOS file into format. """ import sys import os import re paper = 'paper.tsv' reference = 'reference.tsv' defsource = 'source' def main(): global defdestination global defsource source = raw_input('What is the name of source folder?\n') if len(source) < 1: source = defsource files = os.listdir(source) fpaper = open(paper, 'w') fref = open(reference, 'w') uid = 0 for name in files: if name[-3:] != "txt": continue fil = open('%s\%s' % (source, name)) print '%s is processing...' % name first = True Conwos1.py

  12. S519 Python Script for conversion for line in fil: line = line[:-1] if first == True: first = False else: uid += 1 record = str(uid) + "\t" refs = "" elements = line.split('\t') for i in range(len(elements)): element = elements[i] if i == 1: authors = element.split('; ') for j in range(5): if j < len(authors): record += authors[j] + "\t" else: record += "\t" elifi == 29: refs = element refz = getRefs(refs) for ref in refz: fref.write(str(uid) + "\t" + ref + "\n") continue record += element + "\t" fpaper.write(record[:-1] + "\n") fil.close() fpaper.close() fref.close()

  13. S519 Python Script for conversion defgetRefs(refs): refz = [] reflist = refs.split('; ') for ref in reflist: record = "" segs = ref.split(", ") author = "" ind = -1 if len(segs) == 0: continue for seg in segs: ind += 1 if isYear(seg): record += author[:-2] + "\t" + seg + "\t" break else: author += seg + ", " ind += 1 if ind < len(segs): if not isVol(segs[ind]) and not isPage(segs[ind]): record += segs[ind] + "\t" ind += 1 else: record += "\t" else: record += "\t"

  14. S519 Python Script for conversion if ind < len(segs): if isVol(segs[ind]): record += segs[ind][1:] + "\t" ind += 1 else: record += "\t" else: record += "\t" if ind < len(segs): if isPage(segs[ind]): record += segs[ind][1:] + "\t" ind += 1 else: record += "\t" else: record += "\t" if record[0] != "\t": refz.append(record[:-1]) return refz

  15. S519 Python Script for conversion defisYear(episode): pattern = '^\d{4}$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True defisVol(episode): pattern = '^V\d+$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True defisPage(episode): pattern = '^P\d+$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True if __name__ == '__main__': main()

  16. S519 Convert output to database • Using python script: conwos1.py • Output: paper.tsv, reference.tsv

  17. S519 Convert output to database • Paper.tsv

  18. S519 Convert output to database • Reference.tsv

  19. S519 Load them to Access • Import data from external data at Access

  20. S519 Access Tables • Paper table

  21. S519 Access Tables • Citation table

  22. S519 Productivity & impact

  23. S519 Productivity • Top Authors • Find duplicate records (Query template)

  24. S519 Productivity • Top Journals • Find duplicate records (Query template)

  25. S519 Productivity • Top Organizations • Find duplicate records (Query template)

  26. S519 Impact • Highly cited authors • Find duplicate records (Query template)

  27. S519 Impact • Highly cited journals • Find duplicate records (Query template)

  28. S519 Impact • Highly cited articles • Find duplicate records (Query template)

  29. S519 Other indicators • What are other indicators to measure productivity and impact: • Time • Journal impact factor • Journal category • Keyword • … think about something in-depth, what are your new indicators?

  30. S519 Week 11 Author-cocitation network

  31. S519 Top 100 highly cited authors • First select the set of authors with whom you want to build up the matrix • Select top 100 highly cited authors

  32. S519 Author Cocitation Network

  33. S519 Author Cocitation Network

  34. S519 Author Cocitation Network

  35. S519 Load the network to SPSS

  36. S519 Load the network to SPSS

  37. clustering

  38. S519 Clustering Analysis • Aim: create clusters of items that have similarity with others in the same cluster and differences with those outside of the cluster. • So to create similarity within cluster and difference between clusters. • Items are called cases in SPSS. • There are no dependent variables for cluster analysis

  39. S519 Clustering Analysis • The degree of similarity and dissimilarity is measured by distance between cases • Euclidean Distance measures the length of a straight line between two cases • The numeric value of distance should be at the same measurement scale. • If it is based on different measurement scales, • Transform to the same scale • Or create a distance matrix first

  40. S519 Clustering • Hierarchical clustering does not need a decision on the number of cluster first, good for a small set of cases • K-means does need # of clusters first, good for a large set of cases

  41. S519 Hierarchical Clustering

  42. S519 Hierarchical Clustering

  43. Hierarchical Clustering: Data • Data. • The variables can be quantitative, binary, or count data. • Scaling of variables is an important issue--differences in scaling may affect your cluster solution(s). • If your variables have large differences in scaling (for example, one variable is measured in dollars and the other is measured in years), you should consider standardizing them (this can be done automatically by the Hierarchical Cluster Analysis procedure).

  44. Hierarchical Clustering: Data • Case Order • Cluster solution may depend on the order of cases in the file. • You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution.

  45. Hierarchical Clustering: Data • Assumptions. • The distance or similarity measures used should be appropriate for the data analyzed. • Also, you should include all relevant variables in your analysis. • Omission of influential variables can result in a misleading solution. Because hierarchical cluster analysis is an exploratory method, results should be treated as tentative until they are confirmed with an independent sample.

  46. Hierarchical Clustering: Method • Nearest neighbor or single linkage • The dissimilarity between cluster A and B is represented by the minimum of all possible distances between cases in A and B • Furthest neighbor or complete linkage • The dissimilarity between cluster A and B is represented by the maximum of all possible distances between cases in A and B • Between-groups linkage or average linkage • The dissimilarity between cluster A and B is represented by the average of all possible distances between cases in A and B • Within-groups linkage • The dissimilarity between cluster A and B is represented by the average of all the possible distances between the cases within a single new cluster determined by combining cluster A and B.

  47. Hierarchical Clustering: Method • Centroid clustering • The dissimilarity between cluster A and B is represented by the distance between the centroid for the cases in cluster A and the centroid for the cases in cluster B. • Ward’s method • The dissimilarity between cluster A and B is represented by the “loss of information” from joining the two clusters with this loss of information being measured by the increase in error sum of squares. • Median clustering • The dissimilarity between cluster A and cluster B is represented by the distance between the SPSS determined median for the cases in cluster A and the median for the cases in cluster B. All three methods should use squared Euclidean distance rather than Euclidean distance

  48. Measure for Interval • Euclidean distance. The square root of the sum of the squared differences between values for the items. This is the default for interval data. • Squared Euclidean distance. The sum of the squared differences between the values for the items. • Pearson correlation. The product-moment correlation between two vectors of values. • Cosine. The cosine of the angle between two vectors of values. • Chebychev. The maximum absolute difference between the values for the items. • Block. The sum of the absolute differences between the values of the item. Also known as Manhattan distance. • Minkowski. The pth root of the sum of the absolute differences to the pth power between the values for the items. • Customized. The rth root of the sum of the absolute differences to the pth power between the values for the items.

  49. Transform values • Z scores. Values are standardized to z scores, with a mean of 0 and a standard deviation of 1. • Range -1 to 1. Each value for the item being standardized is divided by the range of the values. • Range 0 to 1. The procedure subtracts the minimum value from each item being standardized and then divides by the range. • Maximum magnitude of 1. The procedure divides each value for the item being standardized by the maximum of the values. • Mean of 1. The procedure divides each value for the item being standardized by the mean of the values. • Standard deviation of 1. The procedure divides each value for the variable or case being standardized by the standard deviation of the values.

  50. S519 Hierarchical Clustering: Method