1 / 124

Scholarly Impact in Semantic Web Data: Analysis of Top Journals and Researchers

This study analyzes the scholarly impact in the field of Semantic Web Data through citation impact, specifically focusing on top journals and researchers. The analysis includes data from Scopus (1975-2009) and Web of Science (1960-2009).

wharton
Download Presentation

Scholarly Impact in Semantic Web Data: Analysis of Top Journals and Researchers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Field Profiling

  2. Productivity Top Journals Top Researchers Measuring Scholarly Impact in the field of Semantic Web Data: 44,157 papers with 651,673 citations from Scopus (1975-2009), and 22,951 papers with 571,911 citations from WOS (1960-2009)

  3. Impact through citation Impact Top Journals Top Researchers

  4. Rising Stars • In WOS, M. A. Harris (Gene Ontology-related research), T. Harris (design and implementation of programming languages) and L. Ding (Swoogle – Semantic Web Search Engine) are ranked as the top three authors with the highest increase of citations. • In Scopus, D. Roman (Semantic Web Services), J. De Bruijn (logic programming) and L. Ding (Swoogle) are ranked as top three for the significant increase in number of citations. Ding, Y. (2010). Semantic Web: Who is Who in the field, Journal of Information Science, 36(3): 335-356.

  5. Section 1 Data collection

  6. Steps • Step 1: • Data collection • Using journals • Using keywords • Example • INFORMATION RETRIEVAL, INFORMATION STORAGE and RETRIEVAL, QUERY PROCESSING, DOCUMENT RETRIEVAL, DATA RETRIEVAL, IMAGE RETRIEVAL, TEXT RETRIEVAL, CONTENT BASED RETRIEVAL, CONTENT-BASED RETRIEVAL, DATABASE QUERY, DATABASE QUERIES, QUERY LANGUAGE, QUERY LANGUAGES, and RELEVANCE FEEDBACK.

  7. Web of Science • Go to IU web of science • http://libraries.iub.edu/resources/wos • For example, • Select Core Collection • search “information Retrieval” for topics, for all years

  8. Web of Science

  9. Output

  10. Output

  11. Python • Download Python: https://www.python.org/downloads/ • In order to run Python flawlessly, you might have to change certain environment settings in Windows. • In short, your path is: • My Computer ‣ Properties ‣ Advanced ‣ Environment Variables • In this dialog, you can add or modify User and System variables. To change System variables, you need non-restricted access to your machine (i.e. Administrator rights). • User variable: C:\Program Files (x86)\Python27\Lib; • Or go to command line using “Set” and “echo %path%”

  12. Python Script for conversion #!/usr/bin/env python # encoding: utf-8 """ conwos.py convert WOS file into format. """ import sys import os import re paper = 'paper.tsv' reference = 'reference.tsv' defsource = 'source' def main(): global defdestination global defsource source = raw_input('What is the name of source folder?\n') if len(source) < 1: source = defsource files = os.listdir(source) fpaper = open(paper, 'w') fref = open(reference, 'w') uid = 0 for name in files: if name[-3:] != "txt": continue fil = open('%s\%s' % (source, name)) print '%s is processing...' % name first = True Conwos1.py

  13. Python Script for conversion for line in fil: line = line[:-1] if first == True: first = False else: uid += 1 record = str(uid) + "\t" refs = "" elements = line.split('\t') for i in range(len(elements)): element = elements[i] if i == 1: authors = element.split('; ') for j in range(5): if j < len(authors): record += authors[j] + "\t" else: record += "\t" elifi == 29: refs = element refz = getRefs(refs) for ref in refz: fref.write(str(uid) + "\t" + ref + "\n") continue record += element + "\t" fpaper.write(record[:-1] + "\n") fil.close() fpaper.close() fref.close()

  14. Python Script for conversion defgetRefs(refs): refz = [] reflist = refs.split('; ') for ref in reflist: record = "" segs = ref.split(", ") author = "" ind = -1 if len(segs) == 0: continue for seg in segs: ind += 1 if isYear(seg): record += author[:-2] + "\t" + seg + "\t" break else: author += seg + ", " ind += 1 if ind < len(segs): if not isVol(segs[ind]) and not isPage(segs[ind]): record += segs[ind] + "\t" ind += 1 else: record += "\t" else: record += "\t"

  15. Python Script for conversion if ind < len(segs): if isVol(segs[ind]): record += segs[ind][1:] + "\t" ind += 1 else: record += "\t" else: record += "\t" if ind < len(segs): if isPage(segs[ind]): record += segs[ind][1:] + "\t" ind += 1 else: record += "\t" else: record += "\t" if record[0] != "\t": refz.append(record[:-1]) return refz

  16. Python Script for conversion defisYear(episode): pattern = '^\d{4}$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True defisVol(episode): pattern = '^V\d+$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True defisPage(episode): pattern = '^P\d+$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True if __name__ == '__main__': main()

  17. Convert output to database • Using python script: conwos1.py • Output: paper.tsv, reference.tsv

  18. Convert output to database • Paper.tsv

  19. Convert output to database • Reference.tsv

  20. Load them to Access • Import data from external data at Access

  21. Access Tables • Paper table

  22. Access Tables • Citation table

  23. Section 2 Productivity & impact

  24. Productivity • Top Authors • Find duplicate records (Query template)

  25. Productivity • Top Journals • Find duplicate records (Query template)

  26. Productivity • Top Organizations • Find duplicate records (Query template)

  27. Impact • Highly cited authors • Find duplicate records (Query template)

  28. Impact • Highly cited journals • Find duplicate records (Query template)

  29. Impact • Highly cited articles • Find duplicate records (Query template)

  30. Other indicators • What are other indicators to measure productivity and impact: • Time • Journal impact factor • Journal category • Keyword • … think about something in-depth, what are your new indicators?

  31. Section 3 Author-cocitation network

  32. Top 100 highly cited authors • First select the set of authors with whom you want to build up the matrix • Select top 100 highly cited authors

  33. Author Cocitation Network

  34. Author Cocitation Network

  35. Author Cocitation Network

  36. Load the network to SPSS

  37. Load the network to SPSS

  38. Section 4 clustering

  39. Clustering Analysis • Aim: create clusters of items that have similarity with others in the same cluster and differences with those outside of the cluster. • So to create similarity within cluster and difference between clusters. • Items are called cases in SPSS. • There are no dependent variables for cluster analysis

  40. Clustering Analysis • The degree of similarity and dissimilarity is measured by distance between cases • Euclidean Distance measures the length of a straight line between two cases • The numeric value of distance should be at the same measurement scale. • If it is based on different measurement scales, • Transform to the same scale • Or create a distance matrix first

  41. Clustering • Hierarchical clustering does not need a decision on the number of cluster first, good for a small set of cases • K-means does need # of clusters first, good for a large set of cases

  42. Hierarchical Clustering

  43. Hierarchical Clustering

  44. Hierarchical Clustering: Data • Data. • The variables can be quantitative, binary, or count data. • Scaling of variables is an important issue--differences in scaling may affect your cluster solution(s). • If your variables have large differences in scaling (for example, one variable is measured in dollars and the other is measured in years), you should consider standardizing them (this can be done automatically by the Hierarchical Cluster Analysis procedure).

  45. Hierarchical Clustering: Data • Case Order • Cluster solution may depend on the order of cases in the file. • You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution.

  46. Hierarchical Clustering: Data • Assumptions. • The distance or similarity measures used should be appropriate for the data analyzed. • Also, you should include all relevant variables in your analysis. • Omission of influential variables can result in a misleading solution. Because hierarchical cluster analysis is an exploratory method, results should be treated as tentative until they are confirmed with an independent sample.

  47. Hierarchical Clustering: Method • Nearest neighbor or single linkage • The dissimilarity between cluster A and B is represented by the minimum of all possible distances between cases in A and B • Furthest neighbor or complete linkage • The dissimilarity between cluster A and B is represented by the maximum of all possible distances between cases in A and B • Between-groups linkage or average linkage • The dissimilarity between cluster A and B is represented by the average of all possible distances between cases in A and B • Within-groups linkage • The dissimilarity between cluster A and B is represented by the average of all the possible distances between the cases within a single new cluster determined by combining cluster A and B.

  48. Hierarchical Clustering: Method • Centroid clustering • The dissimilarity between cluster A and B is represented by the distance between the centroid for the cases in cluster A and the centroid for the cases in cluster B. • Ward’s method • The dissimilarity between cluster A and B is represented by the “loss of information” from joining the two clusters with this loss of information being measured by the increase in error sum of squares. • Median clustering • The dissimilarity between cluster A and cluster B is represented by the distance between the SPSS determined median for the cases in cluster A and the median for the cases in cluster B. All three methods should use squared Euclidean distance rather than Euclidean distance

  49. Measure for Interval • Euclidean distance. The square root of the sum of the squared differences between values for the items. This is the default for interval data. • Squared Euclidean distance. The sum of the squared differences between the values for the items. • Pearson correlation. The product-moment correlation between two vectors of values. • Cosine. The cosine of the angle between two vectors of values. • Chebychev. The maximum absolute difference between the values for the items. • Block. The sum of the absolute differences between the values of the item. Also known as Manhattan distance. • Minkowski. The pth root of the sum of the absolute differences to the pth power between the values for the items. • Customized. The rth root of the sum of the absolute differences to the pth power between the values for the items.

  50. Transform values • Z scores. Values are standardized to z scores, with a mean of 0 and a standard deviation of 1. • Range -1 to 1. Each value for the item being standardized is divided by the range of the values. • Range 0 to 1. The procedure subtracts the minimum value from each item being standardized and then divides by the range. • Maximum magnitude of 1. The procedure divides each value for the item being standardized by the maximum of the values. • Mean of 1. The procedure divides each value for the item being standardized by the mean of the values. • Standard deviation of 1. The procedure divides each value for the variable or case being standardized by the standard deviation of the values.

More Related