350 likes | 462 Views
This document outlines techniques for extracting various elements from Microsoft Word XML files using SAS. It covers three scenarios: extracting text with properties, retrieving all data from tables, and identifying the coordinates of objects in drawings. The methodology hinges on the creation of an XMLMAP file, which helps define the paths for accessing different data elements. The guide also includes syntax examples, covering tags, elements, rows, and columns, enabling users to manipulate and analyze Word documents efficiently within SAS.
E N D
Reading Microsoft Word XML files with SAS August 25, 2005 Larry Hoyle -- Policy Research Institute University of Kansas revised 8/18/2005
3 scenarios • Extracting text along with associated properties (styles and attributes) • Extracting all data from tables • Extracting coordinates of objects in drawings
Must begin with this prolog tag Paired tags, must have 1 root tag case sensitive Empty tags end with /> Tags and content called "element" Tags can be Qualified by attributes XML - syntax <?xml version="1.0" ?> <LarryRootTag> <EmptyTag/> <nestedTag> Some content </nestedTag > <nestedTag anAttribute="wha"> Other content </nestedTag > </LarryRootTag> Elements can be nested, Start and end in same parent
Body Section Paragraph Run Text Properties Word XML
Extracting text and properties • SAS XML Engine • Needs XMLMAP file • Can use XML Mapper to generate XMLMAP • Only needs to be generated once for each type of extract
Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys.
XML - Example Document I have never been so humiliated in my life. That was very rude treatment. What a pleasant experience. Your staff was both quick and pleasant. It took about the time I expected to reach someone. I have nothing to say. The sky is blue and the sea is green. You are the worst organization in the world. I love you guys. Paragraph property: /w:wordDocument/w:body /wx:sect/w:p/w:pPr Run property: /w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr.
Rows • The XMLMap has to describe a path that delineates rows: • In this case it’s each text element in a run (in a paragraph…) <TABLE-PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</TABLE-PATH>
Columns – the text • The XMLMap has to describe a path that delineates each column: • The text itself is: <COLUMN name="t"> <PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</PATH>
Columns – the text element number • A sequential number for the text element is: <COLUMN name="tNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:t</INCREMENT-PATH>
Columns – the paragraph number • A sequential number for the paragraph is: <COLUMN name="pNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p</INCREMENT-PATH>
Columns –paragraph color <COLUMN name="PColorVal" retain="YES"> <PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:pPr/w:rPr/w:color/@val</PATH>
Columns – run color <COLUMN name="RColorVal" retain="YES"> <PATH syntax="XPath">/w:wordDocument/w:body/wx:sect/w:p/w:r/w:rPr/w:color/@val</PATH>
Tables - DataSet Rows <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH>
Tables – Table Number <COLUMN name="tblNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl </INCREMENT-PATH>
Tables – Row Number <COLUMN name="trNum" ordinal="YES" retain="YES"> <INCREMENT-PATH beginend="BEGIN" syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr </INCREMENT-PATH>
Nested Tables – Absolute Path for Rows <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH>
Nested Tables – Rootless Path for Rows <TABLE-PATH syntax="XPath"> w:tbl/w:tr/w:tc/w:p/w:r/w:t </TABLE-PATH>
Drawing ObjectsVML – Vector Markup Language • Drawings in Word get stored as XML also • We’ll just look at lines
Dataset – One Row for Each Line <TABLE-PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line </TABLE-PATH>
Dataset – Column: From <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@from </PATH>
Dataset – Column: To <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@to </PATH>
Dataset – Column: StrokeColor <COLUMN name="from"> <PATH syntax="XPath"> /w:wordDocument/w:body/wx:sect/w:p/w:r/w:pict/v:group/v:line/@strokecolor </PATH>
The Dataset Trick: "Flip" indicates coordinates are swapped
Usage Example: Annotate dataset if prxmatch(xyPattern, from) then do; function='move'; x= input(PRXPOSN (xyPattern, 1, from),10.); if prxmatch('/flip:y/',style) then y= -1* input(PRXPOSN (xyPattern, 2, to),10.); else y= -1* input(PRXPOSN (xyPattern, 2, from),10.); output;
Contact Information Larry Hoyle Policy Research Institute, University of Kansas LarryHoyle@ku.edu http://www.ku.edu/pri/ksdata/sashttp/sugi31