520 likes | 973 Views
DAT290 Databases for the World: Designing Multilingual Databases Using SQL Server 2005. Michael Kaplan Technical Lead Globalization Infrastructure, Fonts & Tools Microsoft Corporation. DAT290 Wereld databanken: Meertalige databases ontwerpen met SQL Server 2005. Michael Kaplan
E N D
DAT290Databases for the World:Designing Multilingual Databases Using SQL Server 2005 Michael Kaplan Technical Lead Globalization Infrastructure, Fonts & Tools Microsoft Corporation
DAT290Wereld databanken:Meertalige databases ontwerpen met SQL Server 2005 Michael Kaplan Technical Lead Globalization Infrastructure, Fonts & Tools Microsoft Corporation
Agenda • Unicode tekstvelden gebruiken • Omgaan met niet-Unicode data • SQL Server collation-ondersteuning • Collations aanpassen • Te vermijden collation-valkuilen • Andere taalondersteuningen • Meer informatie
Agenda • Using Unicode text fields • Handling non-Unicode data • SQL Server collation support • Changing Collations • Collation pitfalls to avoid • Other language support issues • Resources
A (null) Unicode 0041 9662 FF96 0000 41 88 AF D6 00 MBCS Unicode is… 0xFFFF Compatibility Private use Surrogates Surrogates Hangul • 16-bit encoding, originally designed to handle 65,536 code points • Supplementary characters allow for over 1 million code points • Microsoft Windows NT (especially Windows 2000) and later is Unicode-based Ideographs(Hanzi, Kanji, Hanja) Kana Symbols Punctuation Thai Indian Arabic, Hebrew Greek Latin ASCII 0x0000 Not to scale!
Using Unicode Text Fields (cont’d)Basics about SQL Server Unicode • NCHAR, NVARCHAR, NTEXT • META data – SYSNAME • N’Unicode’ • UCS-2 Server storage(with some UTF-16 abilities) • Conversions between ANSI to Unicode use MultiByteToWideChar and WideCharToMultiByte
Using Unicode Text Fields (cont’d)Unicode encoding • UTF-16 (or UCS-2) • The “standard” 16-bit Unicode • UTF-32 (or UCS-4) • UTF-8 • “Unicode Transformation Format, 8-bit” • For transmission over unknown lines
Using Unicode Text Fields (cont’d)UCS-2 vs. UTF-16 • The difference? • Supplementary character (also known as surrogate pair) support
Using Unicode Text Fields (cont’d)Supplementary characters • Surrogate range from U+D800 to U+DFFF • Contains 1024 low and 1024 high surrogate values • Surrogate aware comparison operator for all new (*_90) collations • Ability to manipulate and display supplementary characters
Using Unicode Text Fields (cont’d)Surrogate pairs in new (*_90) collations • Supplementary character aware • =, <, >, >=, =<, <>, !=, !<, !> • With Group By, Order By, Distinct • Based on code point comparison • Equivalent to code point comparison when comparing two supplementary characters • When comparing a supplementary and a BMP character, the supplementary character has a fixed weight • Native T-SQL functions give surrogate code units no weight
Agenda • Using Unicode text fields • Handling non-Unicode data • SQL Server collation support • Changing Collations • Collation pitfalls to avoid • Other language support issues • Resources
Handling Non-Unicode DataGB18030 Support • Chinese National Standard • Incorporates all of Unicode, including Extension A and B ideographs • Can be one, two, or four bytes • Microsoft SQL Server has support for: • Converting native GB18030 stream into Unicode • Storing and manipulating native GB18030 (as Unicode characters) • Output can be either native GB18030 or Unicode characters
Handling Non-Unicode Data (cont’d)All other non-Unicode data • Use SQL Server collations • Only ANSI code pages (ACPs) supported • Use the .NET Framework Encoding support or tools like BCP/DTS for other code pages
Agenda • Using Unicode text fields • Handling non-Unicode data • SQL Server collation support • Changing Collations • Collation pitfalls to avoid • Other language support issues • Resources
SQL Server Collation SupportCollation basics • Provides an appropriate sort order German: a < ä, Swedish: z < ä French: x < y, Lithuanian: i < y • Collations also define code pages for non-Unicode conversions • 55 language collations – Updated to Microsoft Windows XP/Microsoft Windows Server 2003 data (plus Turkish i!) • New Comparison Flag • _BIN2 (Binary) Fastest – True Unicode Code Point Compare • New Yukon collations are tagged _90
SQL Server Collation Support (cont’d)Technical details • Windows • 55 language collations • Windows Locale + sort order • Suffix meanings (comparison flag (sort order) • _BIN (Binary) Fastest • _BIN2 (Binary code Point) Fastest* • _CI/_CS (Case sensitivity) (A == a) • _AI/_AS (Accent sensitivity) (á == a) • _KS - kanatype sensitivity (hiragana == katakana) • _WS - width sensitivity (full width == half width) • SQL Collations • Provided for backwards compatibility with prior versions of SQL Server * Slightly slower on insert than _BIN, same speed otherwise
SQL Server Collation Support (cont’d)Technical details (cont’d) • What makes up the full set? • All unique collations • All unique ANSI code pages • All new unique collations • All SQL compatibility collations • The full number of SQL Server collations Select * FROM ::fn_helpcollations()
SQL Server Collation Support (cont’d)A four-level approach • Collation can be specified at four levels • Server • Database • Column • Expression
SQL Server Collation Support (cont’d)At the server level • Acts as a default for all databases • Can be changed with RebuildM.exe in the tools\BINN dir (or unattended setup) • Querying the server collation: SELECT CONVERT(char, SERVERPROPERTY('collation'))
SQL Server Collation Support (cont’d)Setup/upgrade • Unicode datatypes in master database • DB-object metadata converted in Unicode • Sort order compatibility 7.08.0 6.5
SQL Server Collation Support (cont’d)At the database level • Every database has a collation (default is the server collation) • Collation can be changed using ALTER DATABASE Create database db_test collate Latin1_General_CI_AI ALTER DATABASE CONVERT(char, DATABASEPROPERTYEX(‘db_test', 'collation'))
SQL Server Collation Support (cont’d)At the column level • Overrides database level collation • CREATE TABLE jobs • ( • job_id smallint • IDENTITY(1,1) • PRIMARY KEY CLUSTERED, • job_desc nvarchar(50) • COLLATE Arabic_CI_AI_KS • NOT NULL • DEFAULT 'New Position - title not formalized yet', • )
SQL Server Collation Support (cont’d)At the expression level Regular syntax SELECT * FROM tblCustomers ORDER BY LastName, FirstName Collation-aware syntax SELECT * FROM tblCustomers ORDER BY LastName COLLATE Lithuanian_AI_CI, FirstName COLLATE Lithuanian_AI_CI
SQL Server Collation Support (cont’d)Collation example • Inherits collation from Name column (French_CS_AS) • Forced collation/ordering SELECT *FROM PRODUCTSORDER BY Name, Description SELECT *FROM PRODUCTSORDER BY Name COLLATE French_CI_AI, Description
SQL Server Collation Support (cont’d)The rules of precedence
SQL Server Collation Support (cont’d)Precedence example CREATE TABLE TestTab (id int, GreekCol nvarchar(10) collate greek_ci_as, LatinCol nvarchar(10) collate latin1_general_cs_as) INSERT TestTab VALUES (1, N'A', N'a') • This query will return the following error: Msg 446, Level 16, State 9, Server cpsqlintl3, Line 1Cannot resolve collation conflict for equal to operation • This query will work properly and return one row: SELECT * FROM TestTab WHERE GreekCol = LatinCol SELECT * FROM TestTab WHERE GreekCol = LatinCol COLLATE greek_ci_as
SQL Server Collation Support (cont’d)Full Text Search • Allows for word or phrase-based indexing of character data • Full-text indexing enables the creation and population of the full-text catalogs • Maintained outside of SQL Server • Managed by the Microsoft Search service • Full-text search uses new Transact-SQL predicates to query these populated full-text catalogs • CONTAINS • CONTAINSTABLE • FREETEXT • FREETEXTTABLE • With a full-text query, you can perform: • A linguistic search of character data in tables enabled for full-text search • A linguistic search that operates on words and phrases unlike the LIKE predicate which is used to search character patterns
SQL Server Collation Support (cont’d)Full Text Search enhancements • Accent sensitivity • New word-breakers are accent sensitive • Usage of LCID • Specify the LCID on individual queries • Choose the correct language resources in a query • Useful against both XML and BLOB data • Number of supported MS Search word breakers is increasing all the time
Agenda • Using Unicode text fields • Handling non-Unicode data • SQL Server collation support • Changing Collations • Collation pitfalls to avoid • Other language support issues • Resources
Changing CollationsUnicode fields • No data loss – just an order change.... • Gotchas when moving between old and new collations
Changing Collations (cont’d)Non-Unicode fields • Nearly guaranteed data loss in most cases • Many caveats – only 100% safe prior to data being added • Many safeguards to prevent data loss
Changing Collations (cont’d) Updates to tools and technologies • BCP • DTS (Data Transformation Services) • SQLCMD • XML Support
Agenda • Using Unicode text fields • Handling non-Unicode data • SQL Server collation support • Changing Collations • Collation pitfalls to avoid • Other language support issues • Resources
Collation Pitfalls to AvoidAvoid accent-insensitive collations • U+ac00 (가, Hangul Syllable Kiyeok A) under Korean_Wansung_CI_AI or _CS_AI is equal to all of the following characters: • 伽佳假價加可呵哥嘉嫁家暇架枷柯歌珂痂稼苛茄街袈訶賈跏軻迦駕仮傢咖哿坷宊斝榎檟珈笳耞舸葭謌 • All of these Hanja have the same pronunciation as U+ac00, and differ only by that secondary “diacritic” weight • User perception of any operation for sorting/searching that treats them as equal – BROKEN APPLICATION • Other collations have similar problems • Thai, Indic languages, others • Workaround – search with AI but sort by AS (closest matches will be on top)
Collation Pitfalls to Avoid (cont’d)Be careful with wildcard searches • Take two queries: • Using Thai_CS_AS collation,Query #1 returns true, query #2 returns false! • In Thai, U+0e40 U+0e21 is a unique compression • Has its own sort weight • Neither U+0e40 nor U+0e21 can be found individually when they are together in a text field • Workaround: use a Binary collation • if N'เหม' like N'%ม%' select 'true' else select 'false' • if N'เม' like N'%ม%' select 'true' else select 'false‘
Collation Pitfalls to Avoid (cont’d)Binary collations have no language • All *_BIN and *_BIN2 collations are equal for Unicode data • For non-Unicode data, they control code page to use and nothing else
Collation Pitfalls to Avoid (cont’d)Binary collations, SQL Server vs. the CLR *_BIN == Ordinal *_BIN2 == <nothing> <nothing> == OrdinalIgnoreCase
Agenda • Using Unicode text fields • Handling non-Unicode data • SQL Server collation support • Changing Collations • Collation pitfalls to avoid • Other language support issues • Resources
Other Language Support IssuesOverview • Culturally appropriate behavior • Language of error messages • Date/Time • Month/day names and abbreviated month names • First day of week • Currency and currency symbols
Other Language Support Issues (cont’d)Setting the language • Session language syslanguages • Set by ‘set language’ • SET LANGUAGE French • Set by connection attribute • OLE DB – SSPROP_INIT_CURRENTLANGUAGE • ODBC – LANGUAGE keyword • .NET Data Provider – language keyword in ConnectionString • Set by user record in syslogins • sp_configure "language", 2 • sp_helplanguage
Other Language Support Issues (cont’d)Date/Time support • DATETIME • 1 Jan 1753 to 31 Dec 9999 • 1/300 of a second precision (3.33 milliseconds) • Stored as 2x 4-byte integers • SMALLDATETIME • 1 Jan 1900 to 6 Jun 2079 • 1 minute precision (29.998 seconds round down) • Stored as 2x 2-byte integers • No name strings in date/time • Use Convert() Function to format date/time values • SELECT CONVERT(char, GETDATE(), 104) AS [104] • 21.11.2002 • UDT support for Date/Time • Custom calendars!
Other Language Support Issues (cont’d)SQL Server vs. the CLR • Updated collations that match the CLR • Look for the _90 suffix on the name • SQL Server locale semantics: • One setting for UI and formatting • Another setting for collation/encoding • .NET/Windows semantics • One setting for UI • Another setting for formatting/collation • Solution • Special GetCultureInfo override that takes two CultureInfo names for the two SQL Server settings
Other Language Support Issues (cont’d)SQL Server vs. the CLR, continued • Microsoft.ReportingServices.Diagnostics.Localization • CatalogCulture • ClientPrimaryCulture • DefaultReportServerCulture • FallbackUICulture • InstalledCultureNames • ReportParameterCulture • SqlCulture
Other Language Support Issues (cont’d)Documentation enhancements • Books Online • International Considerations for SQL Server • Windows Server System Books • Planning • Upgrading from SQL Server 2000 to 2005 • Migrating your Applications from SQL Server 2000 to 2005 • Planning Guide • Collation Compatibility (collation/language/SQL version) • Collation Support • Collation Performance Considerations • Deployment • Deploying SQL Server 2005 in an International Environment • Mixed Collation Environments • Windows versus SQL Collations • Locale versus Country versus Region
Agenda • Using Unicode text fields • Handling non-Unicode data • SQL Server collation support • Changing Collations • Collation pitfalls to avoid • Other language support issues • Resources
Resources International Features in Microsoft SQL Server 2000: http://msdn.microsoft.com/library/en-us/dnsql2k/html/intlfeaturesinsqlserver2000.asp Arabic Language Support in MS SQL Server 2000: http://msdn.microsoft.com/library/en-us/dnsql2k/html/sql_arabicsupport.asp Preserving Client-Server Data Integrity with Unicode and Microsoft SQL Server 2000:http://msdn.microsoft.com/library/en-us/dnsql2k/html/sql_dataencoding.asp SQL Server Books On Line Sorting It All Out: http://blogs.msdn.com/michkap Global Development: http://microsoft.com/globaldev
Ask The ExpertsGet Your Questions Answered You can find me at the Microsoft Ask the Experts area, located in the Exhibition Hall: Tuesday 5 July 13:30 – 15:00Thursday 7 July 13:30 – 15:00
Community Resources • Microsoft Community Resourceshttp://www.microsoft.com/communities/default.mspx • Non-Microsoft Community Resourceshttp://www.microsoft.com/communities/related/default.mspx • NewsgroupsConverse online with Microsoft Newsgroups, including Worldwidehttp://www.microsoft.com/communities/newsgroups/default.mspx • User GroupsMeet and learn with your peershttp://www.microsoft.com/communities/usergroups/default.mspx • Attend a free chathttp://www.microsoft.com/communities/chats/default.mspx • Attend a free web casthttp://www.microsoft.com/usa/webcasts/default.asp • Most Valuable Professional (MVP)http://mvp.support.microsoft.com/
Microsoft Learning Resources: Get ready for Microsoft Visual Studio 2005 and Microsoft SQL Server 2005 with free Assessments and E-Learning, and a chance to win a laptop, GPS, and more • Click here to access free Microsoft Learning Assessments http://www.microsoft.com/learning/assessment/ind/default.asp • To access free Microsoft Learning E-Learning visit http://www.microsoft.com/learning/access and reference the promotional code 9185-TECHED-6650. Special offers on Microsoft Certificationfrom Microsoft Learning Click here to take advantage of these special offers: http://www.microsoft.com/learning/mcp/
The Microsoft Guide to Support Options Free of charge support Fee based support Managed support 1 2 3 • Self Support • Knowledge Base • Microsoft Newsgroups • Support Webcasts • Assisted Support • Email Support • Phone Support • Advisory Services • Remotely delivered, hourly fee–based, consultative support option • Contract-based Support • Microsoft Premier Support • Provides premium level support and a designated Account Manager • Microsoft Essential Support • Prepackaged support options including phone and online access to a pool of skilled Technical Account Specialists • Microsoft Partner Advantage • Partner Advantage provides technology partners the best response times and highest level of problem resolution support Resources: Microsoft Support Website: http://support.microsoft.com Microsoft TechNet program: http://technet.microsoft.com Microsoft MSDN program: http://msdn.micrososft.com