easier than excel social network analysis of docgraph with gephi
Skip this Video
Download Presentation
Easier than Excel: Social Network Analysis of DocGraph with Gephi

Loading in 2 Seconds...

play fullscreen
1 / 48

Easier than Excel: Social Network Analysis of DocGraph with Gephi - PowerPoint PPT Presentation

  • Uploaded on

Easier than Excel: Social Network Analysis of DocGraph with Gephi. Janos G. Hajagos Stony Brook School of Medicine Fred Trotter fredtrotter.com. DocGraph. Based on FOIA request to CMS by Fred Trotter Pre-released at Strata RX 2012 Medicare providers (more than doctors)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Easier than Excel: Social Network Analysis of DocGraph with Gephi' - kiona

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
easier than excel social network analysis of docgraph with gephi
Easier than Excel: Social Network Analysis of DocGraph with Gephi
  • Janos G. Hajagos
  • Stony Brook School of Medicine
  • Fred Trotter
  • fredtrotter.com
  • Based on FOIA request to CMS by Fred Trotter
  • Pre-released at Strata RX 2012
  • Medicare providers (more than doctors)
  • CY 2011 dates of service
  • Share 11 or more patients in a 30 day forward window
  • Initial access restricted to MedStartr funders
docgraph by the numbers
DocGraph by the numbers
  • Directed graph
  • Average total degree 52.8
  • 940,492 providers (graph nodes/vertices)
  • 49,685,810 shared edges
geographic visualization
Geographic visualization


  • National Plan and Provider Enumeration System
  • Source of NPI (National Provider Identifier)
  • No cost download 
  • Information is entered and updated by provider
    • Data quality is good to poor 
  • CSV file with 314 columns 
  • A custom MySQL load script is used to normalize the database
  • Bloom.api open source project to make data easier to access
    • http://www.bloomapi.com/
graph data
Graph data

Relation between authors and MeSH terms from PubMed



graph types
Graph types
  • Undirected graph
    • Facebook friendships
  • Directed graph
    • Twitter: follow and be followed
  • Bipartite graph
  • Multipartite
    • RDF graph model
    • Property graph model
  • Allow parallel edges
    • RDF graph Model


graphs in healthcare
Graphs in healthcare
  • Prescriber and patient (bipartite)
    • NCPDP data with NPI
  • Referral data sets
  • Shared patients
    • DocGraph
  • Social networks
    • Tweeting about a disease
  • Limited by imagination


generating graphml
Generating GraphML
  • XML based file format for graphs
  • Readable by a large number of tools
    • Gephi
    • Mathematica
    • igraph (R)
  • NetworkX a Python library for graphs which can export to GraphML
  • GraphML is not a file format for really large graphs
  • GraphML is not readable by d3.js
  • Java based open source tool
  • Focused on interactivity
    • Fast graphics
    • Multi-threaded
    • Visual updates
  • Strong graph analytics
  • Graphs stored in memory
    • Upper limit is about 100,000 nodes
  • Netbeans plugin architecture
    • Integration with Neo4J
    • Additional layout algorithms
downloading gephi
Downloading Gephi



downloading sample files
Downloading sample files




Subsets are generated using a Python script

python extract_providers_to_graphml.py "npi=\'1750499653\'" sterrence Leaf-edges

Opening connection referral


Selection criteria for subset graph: npi=\'1750499653\'

Referral table _name: referral.referral2011

NPI detail table name: referral.npi_summary_primary_taxonomy

Nodes will be labeled by: provider_name

Leaf-to-leaf edges will be exported? False

Imported 1 nodes

Imported 986 nodes

Imported 1724 edges

Edge types imported

{\'core-to-leaf\': 866, \'leaf-to-core\': 856: None : 2}

Leaf-to-leaf edges were not selected for export

Writing GraphML file

generating a subset some concepts
Generating a subset: some concepts

Core nodes

Connecting core nodes

Adding leaf nodes

Connecting to leaf nodes

Connecting leaf nodes

sample files
Sample files
  • jamestown_core_provider_graph.graphml
    • Providers selected with practice addresses in Jamestown, NY
    • Small city in far western New York (approximately 30,000 residents)
    • 179 nodes with 5,560 edges
  • jamestown_core_and_leaf_provider_graph.graphml
    • Includes providers above and those who are linked to them
    • 1,322 nodes with 12,457 edges
  • albany_core_provider_graph.graphml
    • Providers selected with practice addresses in Albany, NY
    • A small city in New York (approximately 100,000 residents)
    • 1,368 nodes with 44,711 edges
sample files continued
Sample files (continued)
  • bronx_core_provider_graph.graphml
    • Providers selected with practice addresses in Bronx, NY
    • Urban community (1.4 million residents)
    • 3,268 nodes and 53,828 edges
navigating the graph
Navigating the graph
  • Best experience with a three button mouse with a scroll wheel
    • Right click and hold to pan
    • Scroll wheel to zoom in and out
    • Left click to select
    • Right click for context menus
  • MacBook users
    • command key and click and hold down on trackpad to pan
    • Two fingers to zoom on trackpad
    • Click on trackpad to select
    • Control click for context menus


varying node size based on importance
Varying node size based on importance
  • Step 1: Need to select a measure for node importance
    • Degree
    • PageRank
    • Eigenvector centrality
  • Step 2: Run the measure against the graph
  • Step 3: Ranking tab and “Size/Weight”
  • Step 4: Set size range


graph measures
Graph measures
  • Degree
    • In-degree
    • Out-degree
  • Graph structure measures
    • Clustering (global and local)
    • Network diameter
  • Centrality Measures
    • Eigenvector centrality
    • PageRank (Google search)
  • Community measures
  • And more . . . . .


interactively viewing node attributes
Interactively viewing node attributes

Click the “T” icon on the bottom to turn on node labeling


saving your graph
Saving your graph
  • Save your graph in .gephi format
    • xml based format
    • preserves layout, size, and color
  • Save in GraphML format for use with outside programs


hints for f iltering nodes
Hints for filtering nodes
  • Drag field filter “is_physician” from the top pane to the lower pane
  • Set the value to filter on
    • Value should equal 1
    • 1 is equivalent to true
  • Click “Filter” to apply


producing a final graph
Producing a final graph

We need to rescale the edge weights in the graph


challenge questions
Challenge questions
  • Which institution is the most “important” provider for the Bronx?
    • Hint: try a centrality measure
  • Can you determine if geography plays a role in patient sharing in the Bronx?
    • Which parameter could be used to partition the graph?
  • Can you filter the graph to show only radiologists?
  • Which radiologist has the highest “authority” in the graph?


other tools for graph analysis
Other tools for graph analysis
  • NetworkX
    • Python
    • Lots of algorithms
  • igraph
    • R and Python
  • Gremlin – graph traversal and manipulation
    • Groovy shell
    • Gremlin interface is implemented for Neo4J
  • And more . . .


scaling the analysis to the entire docgraph
Scaling the analysis to the entire DocGraph
  • Most healthcare graphs will be big (millions of nodes)
  • What we learn at the local level can be applied at the global level
    • Importance of geography
    • Supernodes (radiologist, ER docs, pathologist, transportation, …)
  • Many graph measures don’t scale well
    • Maximal cliques
  • Currently exploring how to use Faunus to scale the analysiswith Hadoop



http://strata.oreilly.com/2012/11/docgraph-open-social-doctor-data.html (information)

https://github.com/jhajagos/DocGraph (code)

http://notonlydev.com/docgraph-data/ (open source $1 covers bandwidth fees)

https://groups.google.com/forum/#!forum/docgraph (mailing list)


Try to publish your own healthcare dataset as a graph!