Mark s rasmussen improve dk
This presentation is the property of its rightful owner.
Sponsored Links
1 / 98

Knowing the Internals, Who Needs SQL Server Anyway? PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on
  • Presentation posted in: General

Mark S. Rasmussen – improve.dk. Knowing the Internals, Who Needs SQL Server Anyway?. Whoami. Tech Lead @ iPaper Developer/DBA/Sysadmin/Project manager/* Comp.Sci @ Aarhus University Blogging at improve.dk @improvedk Author of the OrcaMDF project. Disclaimer.

Download Presentation

Knowing the Internals, Who Needs SQL Server Anyway?

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Mark s rasmussen improve dk

Mark S. Rasmussen – improve.dk

Knowing the Internals, Who Needs SQL Server Anyway?


Whoami

Whoami

  • Tech Lead @ iPaper

  • Developer/DBA/Sysadmin/Project manager/*

  • Comp.Sci @ Aarhus University

  • Blogging at improve.dk

  • @improvedk

  • Author of the OrcaMDF project


Disclaimer

Disclaimer

  • Level 650 – meant to inspire, not teach!

  • Based on 2008 R2

  • I have no idea...

  • Most of what I say is incorrect


Background

Background

  • Presentation at Miracle Open World

  • Formally started OrcaMDF


Old school querying

Old School Querying

  • privatestaticvoid oldschool()

  • {

    • using (varconn=newSqlConnection("Data Source=.;Initial Catalog=QFD;…")

    • {

      • conn.Open();

      • varcmd=newSqlCommand("SELECT * FROM Persons", conn);

      • var reader = cmd.ExecuteReader();

      • while(reader.Read())

        • Console.WriteLine(reader["ID"] +": "+ reader["Name"] +" ("+ reader["A

    • }

  • }


Orcamdf querying

OrcaMDF Querying

  • using (varmdf=newMdfFile(mdfPath))

  • {

    • var scanner =new DataScanner(mdf);

    • var rows = scanner.ScanTable("Persons");

    • EntityPrinter.Print(rows);

  • }

  • using (varmdf=newMdfFile(mdfPath))

  • {

    • var scanner =new DataScanner(mdf);

    • var rows = scanner.ScanTable("Persons")

    • .Where(x => x.Field<short>(”Age”) < 40);

    • EntityPrinter.Print(rows);

  • }


Pages

Pages

The foundation of SQL Server storage


What is a page

What Is a Page?

  • 8192 bytes

  • Everything storedas pages


Undocumented dbcc commands flags

Undocumented DBCC Commands & Flags

  • DBCC IND

  • DBCC PAGE

  • DBCC TRACEON (3604)

  • Documented in 6.5 & 7.0

  • Unofficially documented


Page header

Page Header

  • Absolutely no documentation

  • Absolutely necessary for parsing


Reverse engineering the header

Reverse Engineering the Header

  • DEMO

  • OrcaMDF: PageHeader


Slot array

Slot Array

  • Points to beginningof records in body

  • Defines logicalorder of records


Records

Records

  • Data records

    • Stores table data

  • Index records

    • Stores nonclustered index data, as well as non-leaf level clustered index data

  • Stored in the FixedVar format


Fixedvar record format

FixedVar Record Format


Status bits a

Status Bits A


Status bits b

Status Bits B


Null bitmap

NULL Bitmap

  • Bitmap tracking whether columns are NULL

  • CEIL(#Cols / 8) bytes

  • Always present on data pages, except when it’s not

  • Only trust defined bits – rest may be garbage


Variable length offset array

Variable Length Offset Array


Example record

Example Record

CREATE TABLE RecordTest

(

A int,

B int,

C char(5),

D varchar(10),

E varchar(20)

)

INSERT INTO RecordTest VALUES (25, 38, 'ABCD', 'Mark', 'Denmark')


When is data present

When Is Data Present?

  • Fixed length data always present

    • Even if null

      • Though not necessarily tail columns!

  • Variable length data only present when not null

  • Adding nullable columns is a metadata op

    • Denali default value columns is metadata too!


Data types

Data Types

How are data types stored within a record?


Classifying data types

Classifying Data Types

  • Fixed length data types

    • bit, char, int, decimal, date, datetime, float, etc.

  • Variable length data types

    • (n)varchar, varbinary, varchar(MAX), text, etc.

  • sql_variant

    • Please just stay away from it


Variable length data types

Variable Length Data Types

  • SLOBs

    • varchar(x), nvarchar(x), varbinary(x)

  • LOBs

    • text, ntext, image, varchar(MAX), nvarchar(MAX), varbinary(MAX), xml

  • vardecimal


In row n varchar x storage

In-row (n)varchar(x) Storage

  • CREATE TABLE VarcharTest

  • (

  • A varchar(4)

  • )

  • INSERT INTO VarcharTest VALUES ('Mark')


  • Complex columns

    Complex Columns

    • DEMO

    • Identified using the sign bit

    • 0b1001001110010101 = 37.781

    • 0b0001001110010101 = 5.013

    • Use cases

      • Row-overflow/LOB pointers

      • Sparse vectors

      • Back pointers


    Off row slob storage

    Off-row SLOB Storage

    • Varchar, nvarchar, varbinary

    • DEMO


    Off row slob storage1

    Off-row SLOB Storage

    • Column data moved to new page, pointer left behind


    Off row slob storage2

    Off-row SLOB Storage


    Off row slob storage3

    Off-row SLOB Storage

    B = [BLOB Inline Root] Slot 0 Column 2 Offset 0x11a1 Length (physical) 24

    Level = 0Unused = 0UpdateSeq = 1

    TimeStamp = 1298595840Link 0Size = 4500 RowId = (1:21:0)

    • timestamp =BitConverter.ToInt64(data, 8) <<16;


    Blob fragment record

    BLOB_FRAGMENT Record

    Blob row at: Page (1:21) Slot 0 Length: 4514 Type: 3 (DATA)

    Blob Id:469368832

    • Stored on shared (obj-level) TextMix pages


    Off row slob storage4

    Off-row SLOB Storage

    • Allways stored in-row if < 24 bytes

    • 24 byte [BLOB Inline Root] pointer

    • Data stored in BLOB_FRAGMENT on TextMix page

    • Timestamp == Blob ID

    • Performance prediction is tough


    Max lob storage

    (MAX) LOB Storage

    • varchar(MAX), nvarchar(MAX), varbinary(MAX)

    • The LOB that wanted to be a SLOB

    • Three scenarios

      • [BLOB Inline Data]

      • [BLOB Inline Root]

      • [Textpointer]

    • DEMO


    Blob inline data

    ”BLOB Inline Data”

    • Used when data fits in record

    • Not an official LOB structure

    Slot 0 Column 1 Offset 0x0 Length 0 Length (physical) 0

    A = [NULL]

    B = [BLOB Inline Data] Slot 0 Column 2 Offset 0x1393 Length 4 Length (physical) 4

    B = 0x41424344


    Blob inline root

    BLOB Inline Root

    • Can reference up to 5 pages – data, roots, trees, etc.

    • 12 byte header

    • Array of 12 byte references

    • Only used by SLOBs & (MAX) LOBs

    • Also not a LOB structure (by my definition)


    Lob structure records

    LOB Structure Records

    • Wrapped in a single-column ”meta” record


    Knowing the internals who needs sql server anyway

    DATA

    Blob row at: Page (1:176) Slot 0 Length: 8054 Type: 3 (DATA)

    Blob Id:1210253312

    • Type 3

    • Where data is actually stored

    • Size always > 64 bytes (SMALL_ROOT)


    Knowing the internals who needs sql server anyway

    DATA

    • How much data can we store in a DATA record?

      • 8096 – Page body size

      • 8080 (8094)– Theoretical max

      • 8040 (8054)- Reality


    Internal

    INTERNAL

    Blob row at: Page (1:55) Slot 0 Length: 324 Type: 2 (INTERNAL)

    Blob Id: 1210253312 Level: 0 MaxLinks: 501 CurLinks: 19

    Child 0 at Page (1:176) Slot 0 Size: 8040 Offset: 8040

    Child 1 at Page (1:177) Slot 0 Size: 8040 Offset: 16080

    • Type 2

    • CurLinks = number of references

    • MaxLinks = ?

    • Level = tree level

    • Size = computed


    Internal1

    INTERNAL

    Total record size = 20 + X * 16


    Connecting the dots

    Connecting the Dots


    The tree grows

    The Tree Grows

    • In theory (INTERNAL)...

    • In reality... 500


    Connecting the dots1

    Connecting the Dots


    Connecting the dots2

    Connecting the Dots


    Two levels is all it takes

    Two Levels Is All It Takes

    • 8,040,000,000 bytes / 7,48 GB

    • (MAX) limit is 231-1

    • Many permutations


    Large value types out of row

    Large Value Types Out of Row

    • sp_tableoption ‘MyTable’, ‘Option’, ‘ON/OFF’

    • Even more permutations

    • “text in row” 24-7000, default 256


    Textpointer

    Textpointer

    • Used for classic LOB types & MAX LOB types with ’large value types out of row’ ON

      • text, ntext, image

    • Complex column


    Classic lob structures

    Classic Lob Structures

    • You thought (MAX) was complex?

    • Textpointer =


    Small root

    SMALL_ROOT

    • Type = 0

    • Used when data <= 64 bytes

    • Min size = 84

    • Data > length = garbage


    Large root yukon

    LARGE_ROOT_YUKON

    • Type = 5

    • Min size = 84

    • Part of LOB tree


    Connecting the dots3

    Connecting the Dots


    Connecting the dots4

    Connecting the Dots


    Lob storage overview

    LOB Storage Overview

    • Extreme impact on small data

    • The more data, the less of a diff

    • Performance differences

      • http://sqlblog.com/blogs/paul_white/archive/2011/02/23/Advanced-TSQL-Tuning-Why-Internals-Knowledge-Matters.aspx


    Mind the gap

    Mind the Gap


    Bending the will of dbcc page again

    Bending the Will of DBCC Page. Again.


    Archeology 101

    Archeology 101


    Lob summary

    LOB Summary

    • < 8000 => (MAX) = (X)

    • > 8000 => Tree is built

    • Text/ntext/image horribly inefficient

      • Lots of legacy details


    Indices heaps

    Indices & Heaps

    How pages are organized


    Clustered index vs heap

    Clustered Index vs Heap

    • Defines how data is *physically* stored

    • Clustered index

      • Guarantees physical order of data

      • Row identified by ”clustered key”

    • Heap

      • Data stored whereever SQL Server wants to

      • Row identified by ”RID”


    B tree scanning

    B+-tree Scanning


    Heaps

    Heaps

    • Relies on IAM pages

    • Leaf pages not linked

      • Except...


    Page extent allocation

    Page & Extent Allocation

    Extents, pages & objects


    Extents

    Extents

    • All pages allocated as part of an extent

    • Mixed extents

    • Uniform extents

    • First 8 = mixed, rest uniform


    Gam pages

    GAM Pages

    • Global Allocation Map

    • 1 = Free, 0 = Allocated

    • Bitmap tracks 63,904 extents, almost 4GB

    • Present every 511,232 pages

    • GAM interval

    • 2 / 511232, every 511232 pages


    Sgam pages

    SGAM Pages

    • Shared Global Allocation Map

    • 1 = Mixed & > 0 free pages

    • 0 = Either uniform or mixed w/no free pages

    • Structure identical to GAM

    • 3 / 511233, every 511232 pages


    Iam pages

    IAM Pages

    • Index Allocation Map

    • 1 = Uniformly allocated to IAM chain / allocation unit

    • 0 = Not owned by IAM chain / AU

    • No fixed positioning!

    • Tracks a GAM interval

    • Structure (almost) identical to GAM


    Iam page header storage

    IAM Page Header Storage


    Extent allocation status

    Extent Allocation Status


    Pfs pages

    PFS Pages

    • Page Free Space

    • Bytemap

    • 1 / 8088, every 8088 pages

    • PFS interval

    • Only tracks fullness where necessary


    Mdf file at a glance

    MDF File at a Glance


    Allocation units

    Allocation Units


    The boot page

    The Boot Page

    • Page 9 in primary data file

    • DBCC PAGE == DBCC DBINFO

    • Lots of interesting info

      • Physical version

      • Log rebuild count

      • Last OK CHECKDB

      • Last LOG backup

      • Name + ID

      • FirstSysIndexes


    System views base tables

    System Views & Base Tables

    The source of our parsing metadata


    Needed metadata for parsing

    Needed Metadata for Parsing

    • Schema

      • sys.tables + sys.columns / sys.indexes + sys.index_columns

    • Indexes

      • Root page

    • Heaps

      • IAM chain root

    • DEMO


    Allocation metadata overview

    Allocation Metadata Overview


    Dmvs are useless

    DMVs Are Useless!

    • Just views, no physical storage

    • Chicken or the egg

    • How about we take a look at those views?

    • DEMO


    Base tables

    Base Tables

    • The basis for DMV data

    • Can only be queried through the DAC

      • Here be dragons!

    • Confusing column names

    • Utilizes internal functions


    The holy grail of metadata

    The Holy Grail of Metadata

    • SELECT * FROM sys.sysschobjs

    • sysschobjs

    • syscolpars

    • sysrowsets

    • sysallocunits

    • DEMO


    Follow the rabbit

    Follow the Rabbit

    • Boot page points to sysallocunits (FirstSysIndexes)

    • Constant partition ID leads us to sysrowsets

    • Constant object ID leads us to sysschobjs

    • Using the above we can find syscolpars

    • DEMO


    Openrowset table rscprop

    OPENROWSET(TABLE RSCPROP)

    • SELECT

    • CASEc.maxinrowlen

    • WHEN0 THENp.length 

    • ELSEc.maxinrowlen

    • ENDASmax_inrow_length,

    • p.xtype ASsystem_type_id, 

    • p.length ASmax_length, 

    • p.prec ASPRECISION, 

    • p.scale ASscale,

    • FROM

    • sys.sysrscols c –-sys.system_internals_partition_columns

    • OUTERAPPLY

    • OPENROWSET(TABLERSCPROP, c.ti) p


    Openrowset table rscprop1

    OPENROWSET(TABLE RSCPROP)


    Openrowset table rscprop2

    OPENROWSET(TABLE RSCPROP)

    • CREATETABLETITest

    • (

    • A binary(50),

    • B char(10),

    • C datetime2(5),

    • D decimal(12, 5),

    • E float,

    • F int,

    • G numeric(11, 4),

    • H nvarchar(50),

    • I nvarchar(max),

    • J time(3),

    • K tinyint,

    • L varbinary(max),

    • M varchar(75),

    • N text

    • )


    Openrowset table rscprop3

    OPENROWSET(TABLE RSCPROP)

    • SELECT

    • t.name,

    • r.ti,

    • p.scale,

    • p.precision,

    • p.max_length,

    • p.system_type_id,

    • p.max_inrow_length

    • FROM

    • sys.system_internals_partition_columns p

    • INNERJOIN

    • sys.sysrscols r ON

    • r.rscolid = p.partition_column_id AND

    • r.rsid = p.partition_id

    • INNERJOIN

    • sys.types t ON

    • t.system_type_id = p.system_type_id AND

    • t.user_type_id = p.system_type_id

    • WHERE

    • partition_id = 72057594040614912


    Openrowset table rscprop4

    OPENROWSET(TABLE RSCPROP)


    Openrowset table rscprop5

    OPENROWSET(TABLE RSCPROP)

    • 12973 = 0x000032AD

    • 173 = 0xAD == 12973 & 0xFF

    • 50= 0x32 == (12973 & 0xFFFF00) >> 8


    Openrowset table rscprop6

    OPENROWSET(TABLE RSCPROP)

    • 1322= 0x0000052A

    • 42= 0x2A == 1322 & 0xFF

    • 5= 0x5 == (1322 & 0xFF00) >> 8

    • 25= 20 + scale


    Openrowset table rscprop7

    OPENROWSET(TABLE RSCPROP)

    • 330858= 0x00050C6A

    • 106= 0x6A == 330858 && 0xFF

    • 12= 0x0C == (330858 && 0xFF00) >> 8

    • 5= 0x05 == (330858 && 0xFF0000) >> 16


    Creating a type aware ti parser

    Creating a Type Aware TI Parser

    [Test]

    publicvoidDecimal()

    {

    var parser = newSysrscolTIParser(330858);

    Assert.AreEqual(5, parser.Scale);

    Assert.AreEqual(12, parser.Precision);

    Assert.AreEqual(9, parser.MaxLength);

    Assert.AreEqual(106, parser.TypeID);

    Assert.AreEqual(9, parser.MaxInrowLength);

    parser = newSysrscolTIParser(396138);

    Assert.AreEqual(6, parser.Scale);

    Assert.AreEqual(11, parser.Precision);

    Assert.AreEqual(9, parser.MaxLength);

    Assert.AreEqual(106, parser.TypeID);

    Assert.AreEqual(9, parser.MaxInrowLength);

    }


    Data recovery

    Data Recovery

    When everything elsefails


    Please don t rely on this

    Please Don’t Rely on This

    • You should always have backups available

    • Make sure to test your backups

    • Run regular consistency checks

    • This is a last resort measure


    What kind of problems to expect

    What Kind of Problems to Expect

    • Torn pages

    • Corrupt pages

    • Bad metadata

    • Accidental deletes & truncations

    • How does OrcaMDF differ?


    Torn pages

    Torn Pages

    • One page = 16 disk sectors

    • First and last sector most important

    • No header

      • Identify object from IAM, linked list

    • No slot array

      • Slot count in header

      • Identify record formats in body


    Corrupt pages

    Corrupt Pages

    • Checksum doesn’t match content

    • Could be minor issue, probably major

    • Treat like torn page


    Bad metadata

    Bad Metadata

    • SQL Server bugs

    • Corrupt/torn pages

    • Scan pages and identify object in header

    • Scan pages and look for IAM chain

    • Deduce schema

      • App

      • Docs

      • Record format


    Accidental deletes truncations

    Accidental Deletes & Truncations

    • Accidental delete

      • Records may be ghosted

      • Records removed from slot array

      • STOP!

    • Accidental truncation

      • Pages deallocated, physically intact

      • Scan pages, linked list


    Watch out for instant initialization

    Watch Out For Instant Initialization

    • Garbage may be mistaken for data

    • Was page allocated?

      • Look for clues in salvagable allocation structures


    Orcamdf future plans

    OrcaMDF Future Plans

    • CorruptMdf class

    • Verifying tornbits / checksum

    • Utility methods

      • Scan for pages belonging to object

      • Scan for IAM pages

    • Best-effort parsing of pages


    Questions

    Questions

    Now’s the chance


    Thank you

    Thank you!

    Blog:improve.dk

    Twitter:@improvedk

    Email:[email protected]


  • Login