470 likes | 744 Views
Network Analysis Class Data cleaning-name disambiguation User Guide. Office of Portfolio Analysis Division of Program Coordination, Planning, and Strategic Initiatives National Institutes of Health. Network Analysis Class/Data cleaning-Name disambiguation.
E N D
Network Analysis Class Data cleaning-name disambiguation User Guide Office of Portfolio Analysis Division of Program Coordination, Planning, and Strategic Initiatives National Institutes of Health Network Analysis Class/Data cleaning-Name disambiguation
Data cleaning-name disambiguation has three steps All three steps involved in name disambiguation are needed to create a valid and meaningful co-author network New! Step 1: Automated assistance using the iSearch-Publications co-occurrence feature Step 2: Manual check of merged names Step 3: Manual check of unmerged names Network Analysis Class/Data cleaning-Name disambiguation
Data cleaning-name disambiguationStep 1: Automated assistance using the iSearch-Publications co-occurrence feature Network Analysis Class/Data cleaning-Name disambiguation
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 Open iSearchPublications module https://itools.od.nih.gov/dashboard/ • If you already know your list of PMIDs, just search for them in iSearch 2 1 1- Copy and paste the list of PMIDs to the search box 2- Select “OR” 3- Select the “PMID” as the field to search 4- Hit the search button (Magnifying glass icon) 4 3 Field to search Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 • If you don’t know your portfolio of publications, proceed to do the search for publications as usual • Once you have your publications in iSearch, go to the export button and select “Co-occurrence Graph” Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 • A pop up window (below) will open Select the field to use to generate the Co-occurrence Graph. For co-author networks it is the authors field. 1 It is the minimum number of times values have to co-occur to appear it in the network. For co-author networks it is the minimum number of times authors have to publish together to generate an edge. 2 3 It is the number of values for a field in a document before the values are truncated to the first N and the last value. For example if authors is selected and maximum values is set to 50 then only the first 50 authors and the last author are used in the co-occurrence graph generation. 4 give a title to the file (optional) Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 Select the field to use to generate the Co-occurrence Graph For co-author networks select the authors field 1 2 Minimum Edge Weight It is the minimum number of times values have to co-occur to appear it in the network. For co-author networks it is the minimum number of times authors have to publish together to generate an edge. The default number is 2. Even if we move the slider to 1 (which we recommend), this setting implies that isolated nodes are not going to show in the edges file that this feature produces. Arranges need to be made to add isolates in case they exist and the edges’ file is used. Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 • Maximum values per document • It is the number of values for a field in a document before the values are truncated to the first N and the last value. For example if authors is selected and maximum values is set to 50 then only the first 50 authors and the last author are used in the co-occurrence graph generation. • To search for the maximum number of author in your portfolio there are two options: • Option 1: Download the data and check for the max number of authors using OPA Training Excel Tip 11 • Export the publications to an excel file (including the authors field) and follow the excel tip 11 on OPA website https://dpcpsi.nih.gov/eo/intranet/opa/training/excel_tips 3 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 Maximum values per document • Option 2: Create a “Portfolio” in iSearch • Create a portfolio for the publications • Name the portfolio and click “Create” In this example this publication portfolio will have 7 PMIDs Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 Maximum values per document • Option 2: Create a “Portfolio” in iSearch • When asked if you want to create a portfolio without adding curators, simply select “Yes” • Then open the portfolio clicking on the “Portfolios” icon and select the portfolio Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 Maximum values per document • Option 2: Create a “Portfolio” in iSearch • Once the portfolio is opened, use the following search query to identify how many of your publications have for example more than 1, but less than 5 authors: authorCount:[1 TO 5] • None of the 7 publications of our example have less than 5 authors
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 Maximum values per document • Option 2: Create a “Portfolio” in iSearch • Repeating the query for more than 1 but less than 10 authors: authorCount:[1 TO 10] allows us to see that all our publications have 10 or less authors
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 Maximun values per document Once we know what is the maximum number of authors in our publications, we can set the maximum value per document for example to 11 (this way all our pubs will be included with their total amount of authors in the co-occurrence graph) 3 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 Then name the file (if desired) and click on “Generate” 4 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 • There are three output files under iSearch notifications: • Node file • Edge file • Docs file Note that the extension of the files is .TSV. To open them in Excel, download the files and then look for them in the “Downloads” folder of your computer iSearch notifications icon Output files Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 To open the .TSV files in Excel, open a new Excel spreadsheet and click “open”. When prompted with the “Text Import Wizard” window, we should select the Unicode UTF-8 option (pic below). • Output files: • Nodes file • This file has information about each node. The file has the following columns: • Best Name: Name for the node • Merged Names: Other names for this node • PMIDs: The Pubmedids this node occurred on • Grants: The grants linked to the publications this node occurred on • Clinical Trials; patents; journals; earliest pub year; latest pub year this node occurred on Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 • Output files: • Edges file • This file shows the relationship between the entities. The file has the following columns: • Source and target nodes column: In the example of co-author networks, the source and target nodes are interchangeable since co-author networks are undirected. • Weight: The number of times they co-occur is the weight column • PMIDs: the PMIDs where the source and target nodes co-occur in are listed in this column These two authors only published together one time, in PMID 21642420 therefore their edge’s weight is 1 These two authors only published together three times, in PMIDs 20096814, 20974641, and 21642420 therefore their edge’s weight is 3 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 • Output files: • Docs file • This file has information about each node. The file has the following columns: • PMID • Original names (as they appear in the publications) • Clustered names, which are the already disambiguated names Author “Lui, Julian” had two different denominations in the input file (original names column) and only one in the output file (clustered names column) Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguationStep 2: Manual check of merged names Network Analysis Class
Data cleaning-name disambiguation: Step 2 Manual check of merged names Step 2 • Output files: • The nodes file has a column called “merged names” • The Merged Names column shows you which names have been merged.If any name was merged but shouldn’t have, it has to be manually corrected in the Docs file’ “clustered names” column and the nodes and edges files provided by iSearch can no longer be used (they will not include any manual corrections). Instructions on how to create the nodes and edges files can be found at https://dpcpsi.nih.gov/sites/default/files/Data%20formatting-Step%20by%20step.pdf Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguationStep 3: Manual check of unmerged names. The following slides will show you the step by step description of Part 1 Network Analysis Class
Data cleaning-name disambiguation STEP 3 OVERVIEW Step 3 Overview Part 1: List of Sci2 unique names from the Clustered Names Column Part 2: Excel pivot table showing all Last name First initial variations Docs file, Clustered Authors Column Typo? Same author? Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part1 • Create a list of unique names using the “Clustered names” column 1. Load the iSearch output Docs file into Sci2 Data Manager Panel Drag and drop the iSearch output “Docs file” (.CSV format is required) into Sci2 Data Manager panel Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part1 • Create a list of unique names using the “Clustered names” column 2. Select the “Standard csv format” option Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part1 • Create a list of unique names using the “Clustered names” column 3. Now you should be able to see the loaded file in the “Data Manager” panel Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part1 • Create a list of unique names using the “Clustered names” column 4. Go to Data Preparation Extract Co-occurrence network Select Extract Co-Occurrence Network from the dropdown box Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part1 • Create a list of unique names using the “Clustered names” column 5. In the “Extract network from table” window, select the column containing the names cleaned by iSearch, and the delimiter present within names • Column Name: Clustered Names • Text delimiter: semicolon (“;”) The delimiter within these names is a semicolon (“;”) Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part1 • Create a list of unique names using the “Clustered names” column 6. Two new files will now appear in the Data Manager panel below the loaded CSV file • Two new files: • Extracted network on column clustered • Merge table based on clustered names Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part1 • Create a list of unique names using the “Clustered names” column 7. Select the Merge table based on Clustered names file, and right click on it Right click with mouse on the Merge table file Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part1 • Create a list of unique names using the “Clustered names” column 8. Select Open with From the dropdown box select “View with” Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part1 • Create a list of unique names using the “Clustered names” column 9. Select the Microsoft Excel Comma Separated Values File from the dropdown menu From the dropdown box select open with Excel Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part1 • Create a list of unique names using the “Clustered names” column 10. An excel spreadsheet will open with 3 columns. Column A contains the list of unique names of authors cleaned by iSearch Delete columns B and C since they are not needed for Step 3 of name disambiguation (these columns contain Sci2 internal values) Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguationStep 3: Manual check of unmerged names. The following slides will show you the step by step description of Part 2 Network Analysis Class
Data cleaning-name disambiguation Step 3 OVERVIEW Step 3 Part2 • Create a Excel pivot table showing all Last name First initial variations Part 1: List of unique names from the Clustered Names Column Part 2: Excel pivot table showing all Last name First initial variations Output file, Clustered Names Column Typo? Same author? Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Create a Excel pivot table showing all Last name First initial variations 1. Copy and paste column A into column B Select column A Copy & Paste in column B Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Create a Excel pivot table showing all Last name First initial variations • Steps to separate last and first names 2. Select column B and go to Data Text to Column Select Column B Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Part 2: Create a Excel pivot table showing all Last name First initial variations • Steps to separate last and first names 2a: Under Original data type, choose delimited 2b: Under delimiters, choose Comma (last names and first names are separated by commas) Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Part 2: Create a Excel pivot table showing all Last name First initial variations • Now we have last name and first and second names in different columns • (rename column B to Last name) First and second name column Last name column Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Part 2: Create a Excel pivot table showing all Last name First initial variations • To separate first and second names 1- Select column C and go to Data Text to Column Select Column C Label column Last name column Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Part 2: Create a Excel pivot table showing all Last name First initial variations • To separate first and second names 2a- Under Original data type, choose delimited 2b- Under delimiters, choose Space (first and second names are separated by spaces) Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Part 2: Create a Excel pivot table showing all Last name First initial variations • Now we have first name and second names in different columns • 3- Delete column C (empty) Delete • 4- Label Column C as First Name, and delete column D (we don’t need the second names/initials) Delete Delete
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names • Part 2: Create a Excel pivot table showing all Last name First initial variations • To separate create a column with first name’s first initial • 5a- Use the following formula in cell D2 • =LEFT(C2,1) • 5b- Then copy this formula on all column D cells • 5c- Finally, select entire column D and copy and paste the content as values • 5d- Label column as First initial Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names • Part 2: Create a Excel pivot table showing all Last name First initial variations Leave an empty space here • To create a column with last name first initial • 6a- Use the following formula in cell E2 • =B2& “, ”&D2 (please note that there is an empty space after the comma) • 6b- Then copy this formula on all column E cells • 6c- Finally, select entire column E and copy and paste the content as values • 6d- Label column E as Last name first initial Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Part 2: Create a Excel pivot table showing all Last name First initial variations • 7- Insert a pivot table Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Part 2: Create a Excel pivot table showing all Last name First initial variations • 8- Set the Pivot table Fields as show below Columns: - Count of label Rows: - Last name First initial - label
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Part 2: Create a Excel pivot table showing all Last name First initial variations • 9- Sort the Count of label subtotals from largest to smallest In this example there are no “Last Name First Initial” with more than one “label”. But in the following slide we show other examples where the number of labels per “Last Name First Initial” is more than 1.
Data cleaning-name disambiguation: Step 3 Manual check of unmerged names Step 3 Part2 • Part 2: Create a Excel pivot table showing all Last name First initial variations Typo or two different authors? • Examples of names that should have been merged but were not. • If you find a name that should have been merged but wasn't: • Go to the iSearch Docs file and correct the name in the Clustered Names column (using the “Find” and “Replace” feature) Same author? Third author? ? Now we are ready to format our data in Sci2! Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network