Reading microsoft word xml files with sas
Download
1 / 35

- PowerPoint PPT Presentation


  • 142 Views
  • Updated On :

Reading Microsoft Word XML files with SAS® . Larry Hoyle, Policy Research Institute, University of Kansas. Three Scenarios. Extracting text and attributes Extracting data from tables Extracting drawing object parameters . XML - Syntax. Must begin with this prolog tag.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - dusty


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Reading microsoft word xml files with sas l.jpg

Reading Microsoft Word XML files with SAS®

Larry Hoyle,

Policy Research Institute,

University of Kansas


Three scenarios l.jpg
Three Scenarios

  • Extracting text and attributes

  • Extracting data from tables

  • Extracting drawing object parameters


Slide3 l.jpg

XML - Syntax

Must begin with this

prolog tag

Paired tags,

must have 1 root tag

case sensitive

Tags and content

called "element"

Tags can be

Qualified by

attributes

<?xml version="1.0" ?>

<LarryRootTag>

<EmptyTag/>

<nestedTag>

Some content

</nestedTag >

<nestedTag anAttribute="wha">

Other content

</nestedTag >

</LarryRootTag>

Elements can be nested,

Start and end in same parent



Slide5 l.jpg

Word XML

Body

Section

Paragraph

Run

Text

Properties



What does sas need l.jpg
What Does SAS Need?

  • SAS XML Engine

  • Needs XMLMAP file

  • Can use XML Mapper to generate XMLMAP

  • Only needs to be generated once for each type of extract


Example document styles and colors have meaning l.jpg
Example Document Styles and Colors Have Meaning

I have never been so humiliated in my life. That was very rude treatment.

What a pleasant experience. Your staff was both quick and pleasant.

It took about the time I expected to reach someone.

I have nothing to say. The sky is blue and the sea is green.

You are the worst organization in the world.

I love you guys.


Style and color l.jpg
Style and Color

  • Style is “Treated” – a statement about treatment

  • Color is “Red” - represents negative affect


Slide10 l.jpg

Example Document as XML

I have never been so humiliated in my life. That was very rude treatment.

What a pleasant experience. Your staff was both quick and pleasant.

It took about the time I expected to reach someone.

I have nothing to say. The sky is blue and the sea is green.

You are the worst organization in the world.

I love you guys.

Paragraph property:

/w:wordDocument/w:body

/wx:sect/w:p/w:pPr

Run property:

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.


Slide11 l.jpg
Rows

  • The XMLMap has to describe a path that delineates rows:

  • In this case it’s each text element in a run (in a paragraph…)

    <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>


Columns the text l.jpg
Columns – the Text

  • The XMLMap has to describe a path that delineates each column:

  • The text itself is:

    <COLUMN name="t">

    <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>


Columns the text element number l.jpg
Columns – the Text Element Number

  • A sequential number for the text element is:

    <COLUMN name="tNum" ordinal="YES“ retain="YES">

    <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>


Columns the paragraph number l.jpg
Columns – the Paragraph Number

  • A sequential number for the paragraph is:

    <COLUMN name="pNum" ordinal="YES" retain="YES">

    <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>


Columns paragraph color l.jpg
Columns –Paragraph Color

<COLUMN name="PColorVal" retain="YES">

<PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>


Columns run color l.jpg
Columns – Run Color

<COLUMN name="RColorVal" retain="YES">

<PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>


Columns run style l.jpg
Columns – Run Style

<COLUMN name="RStyleval">

<PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:rStyle/@val</PATH>

<TYPE>character</TYPE>

<DATATYPE>string</DATATYPE>

<LENGTH>11</LENGTH>

</COLUMN>




Our sample tables l.jpg
Our Sample Tables

  • Read all data from all tables into one dataset

  • Add variables to indicate table, row, column



The tables dataset22 l.jpg
The Tables Dataset

Ended first table

Started third table


Word xml tables l.jpg
Word XML – Tables

  • Absolute Path

    /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t

  • Relative Path

    w:tc/w:p/w:r/w:t


Count table beginnings l.jpg
Count Table Beginnings

  • <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> w:tbl</INCREMENT-PATH>


Count table endings l.jpg
Count Table Endings

  • <INCREMENT-PATH beginend=“END" syntax="XPath"> w:tbl</INCREMENT-PATH>



Drawing object parameters vml vector markup language l.jpg
Drawing Object Parameters VML – Vector Markup Language

  • This example will only read lines

    • (they’re easiest)

  • Other drawing objects have different XML elements




One row for each line element l.jpg
One Row for Each Line Element

<TABLE-PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line

</TABLE-PATH>


Columns parameters as attributes l.jpg
Columns Parameters as Attributes

<COLUMN name="from">

<PATH syntax="XPath">

/w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from

</PATH>


Slide32 l.jpg

The Dataset

Trick:

"Flip" indicates coordinates are swapped


Example code in paper l.jpg
Example Code in Paper

  • Convert colors

  • Parse stroke weight (e.g. 2pt)

  • Detect the keyword “flip” and flip coordinates



Contact information l.jpg
Contact Information

Larry Hoyle

Policy Research Institute,

University of Kansas

LarryHoyle@ku.edu

http://www.ku.edu/pri/ksdata/sashttp/sugi31