reading microsoft word xml files with sas
Download
Skip this Video
Download Presentation
Reading Microsoft Word XML files with SAS®

Loading in 2 Seconds...

play fullscreen
1 / 35

Reading Microsoft Word XML files with SAS - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

Reading Microsoft Word XML files with SAS® . Larry Hoyle, Policy Research Institute, University of Kansas. Three Scenarios. Extracting text and attributes Extracting data from tables Extracting drawing object parameters . XML - Syntax. Must begin with this prolog tag.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Reading Microsoft Word XML files with SAS ' - dusty


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
reading microsoft word xml files with sas

Reading Microsoft Word XML files with SAS®

Larry Hoyle,

Policy Research Institute,

University of Kansas

three scenarios
Three Scenarios
  • Extracting text and attributes
  • Extracting data from tables
  • Extracting drawing object parameters
slide3

XML - Syntax

Must begin with this

prolog tag

Paired tags,

must have 1 root tag

case sensitive

Tags and content

called "element"

Tags can be

Qualified by

attributes

<?xml version="1.0" ?>

<LarryRootTag>

<EmptyTag/>

<nestedTag>

Some content

</nestedTag >

<nestedTag anAttribute="wha">

Other content

</nestedTag >

</LarryRootTag>

Elements can be nested,

Start and end in same parent

slide5

Word XML

Body

Section

Paragraph

Run

Text

Properties

what does sas need
What Does SAS Need?
  • SAS XML Engine
  • Needs XMLMAP file
  • Can use XML Mapper to generate XMLMAP
  • Only needs to be generated once for each type of extract
example document styles and colors have meaning
Example Document Styles and Colors Have Meaning

I have never been so humiliated in my life. That was very rude treatment.

What a pleasant experience. Your staff was both quick and pleasant.

It took about the time I expected to reach someone.

I have nothing to say. The sky is blue and the sea is green.

You are the worst organization in the world.

I love you guys.

style and color
Style and Color
  • Style is “Treated” – a statement about treatment
  • Color is “Red” - represents negative affect
slide10

Example Document as XML

I have never been so humiliated in my life. That was very rude treatment.

What a pleasant experience. Your staff was both quick and pleasant.

It took about the time I expected to reach someone.

I have nothing to say. The sky is blue and the sea is green.

You are the worst organization in the world.

I love you guys.

Paragraph property:

/w:wordDocument/w:body

/wx:sect/w:p/w:pPr

Run property:

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.

slide11
Rows
  • The XMLMap has to describe a path that delineates rows:
  • In this case it’s each text element in a run (in a paragraph…)

<TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>

columns the text
Columns – the Text
  • The XMLMap has to describe a path that delineates each column:
  • The text itself is:

<COLUMN name="t">

<PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>

columns the text element number
Columns – the Text Element Number
  • A sequential number for the text element is:

<COLUMN name="tNum" ordinal="YES“ retain="YES">

<INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>

columns the paragraph number
Columns – the Paragraph Number
  • A sequential number for the paragraph is:

<COLUMN name="pNum" ordinal="YES" retain="YES">

<INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>

columns paragraph color
Columns –Paragraph Color

<COLUMN name="PColorVal" retain="YES">

<PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>

columns run color
Columns – Run Color

<COLUMN name="RColorVal" retain="YES">

<PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>

columns run style
Columns – Run Style

<COLUMN name="RStyleval">

<PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:rStyle/@val</PATH>

<TYPE>character</TYPE>

<DATATYPE>string</DATATYPE>

<LENGTH>11</LENGTH>

</COLUMN>

our sample tables
Our Sample Tables
  • Read all data from all tables into one dataset
  • Add variables to indicate table, row, column
the tables dataset22
The Tables Dataset

Ended first table

Started third table

word xml tables
Word XML – Tables
  • Absolute Path

/w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t

  • Relative Path

w:tc/w:p/w:r/w:t

count table beginnings
Count Table Beginnings
  • <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> w:tbl</INCREMENT-PATH>
count table endings
Count Table Endings
  • <INCREMENT-PATH beginend=“END" syntax="XPath"> w:tbl</INCREMENT-PATH>
drawing object parameters vml vector markup language
Drawing Object Parameters VML – Vector Markup Language
  • This example will only read lines
    • (they’re easiest)
  • Other drawing objects have different XML elements
one row for each line element
One Row for Each Line Element

<TABLE-PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line

</TABLE-PATH>

columns parameters as attributes
Columns Parameters as Attributes

<COLUMN name="from">

<PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from

</PATH>

slide32

The Dataset

Trick:

"Flip" indicates coordinates are swapped

example code in paper
Example Code in Paper
  • Convert colors
  • Parse stroke weight (e.g. 2pt)
  • Detect the keyword “flip” and flip coordinates
contact information
Contact Information

Larry Hoyle

Policy Research Institute,

University of Kansas

[email protected]

http://www.ku.edu/pri/ksdata/sashttp/sugi31

ad