340 likes | 355 Views
Learn best practices for internationalization testing in web apps: importance of good test data, types of I18n data, efficient testing, testing strategies, and tools to streamline the process.
E N D
Web Apps I18n Testing and Test Data Katsuhiko Momoi Sr Test Engineer, Google Inc.
An Outline • Web Apps Internationalization/Localization • Current scenes and practices • I18n Testing and Test Data • Good test data -- Essential for good i18n test coverage • Types of I18n Data & their uses • Delivery mechanisms • Efficient I18n Testing • Tools/APIs • Summary and Conclusion
Web Apps I18n and L10n • Increasing number of web apps • 30 or more at MSN, Yahoo, Google, etc. • Typically info oriented apps, e.g. search, maps, jobs, locals, etc. • Also some desktop-like apps – chat, mail, document, spreadsheet, presentation, photos, etc. • Support for a large number of languages • Yahoo, MSN, Google – Home pages in over 40 languages • Broader revenue base? • Core code base in Unicode • Display pages in UTF-8 • Major web sites have switched to Unicode
Web Apps Development and I18n • Frequent updates and releases • Features more important than stability? • Short development cycles • 1-4 week release cycles • Pressures on testing teams • At any given time … • Multiple test servers and multiple versions to be tested • Multiple languages released at the same time • One binary, many localized languages • Last minute translation updates are frequent • New testing strategies needed
General Testing Strategies: Test Code! • Devise a testing plan at product design stage • Push testing upstream into code writing • Refactor code as needed to make it more testable • Make code units testable • Tools like Testability Explorer are useful • http://code.google.com/p/testability-explorer/ • Test engineers and coding engineers work side by side • Minimize the release cycle • Testing is everyone’s business!
General Testing Strategies: Use humans wisely! • Automated acceptance test for continuous builds • Bugs, Latency, Performance issues tracking • Automate as many tasks possible • Use/create tools to help automate testing • Use open source tools/test frames • Minimize UI level testing wherever possible • Human testers for exploratory testing and analysis • Complex scenario testing
I18n Testing: An overview • Begin I18n feature/code discussion at a product design stage • I18n testing plan early in the development • Feature requirement • also includes local market requirements • Libraries • 3rd Party components • Check I18n readiness • Review localization plan and schedule • Start exploratory testing as soon as code is ready • Include I18n test cases in automated smoke test • A small number of critical tests for continuous builds
I18n Testing: Before Localization Begins • Is the code Unicode compliant? • Run unit tests • Look for ways to test locale-dependent features before L10n • Test product code against I18n libraries • Time/Date/Currency formats, Collation, etc. • Identify special locale dependent features • Run language specific feature tests against an English or Pseudo-localized build (e.g. CJK IME testing) • Sniff out potential local product requirement issues
I18n Testing: Localizability • Use pseudo-localization to identify • Unextracted/hard-coded strings • String Concatenation issues • Identify potential text expansion problems A sample: Original: Use default text encoding for outgoing messages Pseudo:[Ûšé ðéƒåûļţ ţéxţ éñçöðîñĝ ƒör öûţĝöîñĝ méššåĝéš one two three four five] • Establish localizability check point before L10n • Work with the translation team to establish guidelines • A set of check items that need to be passed before serious localization can begin • For example, the set may include all of the testing items mentioned so far • Ex of Syntax problem: (You bought) [merchandize] (on) [date]
I18n Testing: When Localization begins • Almost all major I18n issues should have been caught/fixed long before this • Run functional tests against localized builds • Things to look out for: • Functional breakage due to literal dependency • Bidi UI breakage • UI breakage/text expansion issues • Untranslated strings • Translation appropriateness • Usually a task for linguistic reviewers • Get a product evaluation done by users in target countries or by target language users
I18n Test Areas: A summary • Unicode Compliance • Input/Output correctness for character data • Functional correctness with non-ASCII data • Locale Dependent Code Behavior: General • Date/Time/Time Zone, Collation/Sorting, Search/Filter, Language switch, Address, Currency • Locale Dependent Code Behavior: Product Specific • Webmail: outgoing mail encoding selection • Special features offered via geo-location setting, e.g. weather info • Early discovery of Localizability issues (before localization begins) • Use pseudo-localization as a tool • Functional testing under localized UI • Watch out for Bidi/RTL UI issues
Data: Unicode Compliance Test with data based on the latest Unicode standard (5.x) • Unformatted string data: random characters • Build test data with UnicodeSet/ICU • Use patterns or build programmatically • Can also build UnicodeSet from script(s) associated with a locale • CLDR exemplar characters (but probably not auxiliary set) • Unformatted string data: natural language data • May build a tool/API to generate real language string • Useful to create language data from a small subset of major script types • e.g. CJK, Latin1, RU, HE/AR, TH • Not tied to any specific locale • Hodge-podge data set from major languages • Good for manual testing
Data: Quick Sanity Check for Unicode Compliance • If you want a quick sanity check on Unicode compliance … • Hand-crafted short strings may suffice: • Îñţérñåţîöñåļîžåţîöñ (UTF-8 two byte characters) • (UTF-8 two to four byte characters) • Random vs. real language strings • Random strings • Unit/Code testing • Real language strings • Manual testing • Real language data is needed for detecting language • Quick sanity check with Unicode compliance – easily recognizable
Bad Data: Invalid/malformed/broken data • Test your code with bad data • Learn how your product deals with the unexpected • What happens when the product has to process invalid Unicode data? • Malformed UTF-8, UTF-16 bytes • Incorrect BOM • Text data file uploaded with incorrect information about encoding • EUC-JP file uploaded as Shift_JIS file for Japanese • File uploads with incorrectly encoded data • Data not defined under a protocol • Plain text data with HTML entities
Data: Local Encoding Data • Most current web apps use UTF-8 • Ex. All major web mail services and search home pages are in UTF-8 • Local encoding data are needed only for: • Data upload and download from your web apps • Address books, Spreadsheet data in csv format • Apps accepting user uploads of text files • Your web pages/apps are embedded in another web site using a local encoding • Search/Ads display syndicated to another web site
Lang/Locale Specific Formats: Date/Time • Difficult to test for accuracy • A number of format available on all platforms: Slovenian examples • Full: EEEE, dd. MMMM yyyy (sreda, 30. julij 2008) • Long: dd. MMMM yyyy (30. julij 2008) • Medium: d.M.yyyy (30.7.2008) • Short: d.M.yy (30.7.08) • Shortened forms are typical for web apps • Date/Time display in webmail Inbox view • Gmail: 30. jul • Some locale formats get old and newer formats may not be available from platforms • Updates help but sometimes not fast enough • Differences of opinions among natives
Lang/Locale Specific Formats: Number/Currencies • Platform Libraries provide number and currency formats • Testing issues: • Number Formats are quite varied from locale to locale • Placement of currency symbols relative to numbers • The Euro is often recommended to be placed after the number • 3.50€ • But is often written as €3,50 or even 3€50 (DE or FR locale) • Native currency conventions carried over to the Euro • Native speaker validation is important • But such testers may not be available for all languages
Lang/Locale Specific Formats: Addresses/Phone Numbers • Names, Addresses, Phone Numbers • These formats are usually not supplied by platform libraries • Custom address widgets may be necessary • Different countries/regions may need: • different order of address fields (larger to smaller or smaller to larger domains) • more street address lines (3 for example) • 'extra field' in their address • 'static text' between address items • no State (or province) in address • three levels for administrative divisions on the address (Province, second level City, third level city or county), e.g. China • different label name for their address item • A single Display Name field is easier than separate given name, middle name, family name fields
Lang/Locale Specific Data: Collation, Search/Filter • Data for Collation/Sorting and Search/Filter don’t have specific formats for languages/locales • Best to use authentic language data • Collation samples available from ICU • Optional settings make validation task complex • Punctuations, lower/upper case distinctions • Search/Filter: • CJK Thai segmentation (ICU support available) • Query language detection needed • Real language strings are must
Language Specific Data: Mail, Calendar, etc. Some functions are best tested with real language data • Mail data for outgoing message encoding tests • Need to be language specific since best encoding selection is language bound • Examples: • Cyrillic characters: • KOI8-R for Russian but KOI8-U for Ukrainian • Chinese characters: • ISO-2022-JP for ja, GB2312 for zh-Hans, Big5 for zh-Hant, EUC-KR for ko • Google Calendar • Event creation via text input: e.g. “6pm Party at John’s house” • Starting time: 6pm (on the day selected) • Place: John’s house • Spellchecker language • Auto-detect language and offer a correct language spell checker
What do we need? • Fast Development Cycles • Support for a large number of languages • Many items to cover for I18n testing • Even automated tests take too long if they need to be run against 50+ languages … … • We need Efficient I18n Testing!
Efficient I18n Testing: Code Coverage • Key concept is test coverage for the written code • Is there sufficient testing written for every class? • Is code testable easily? • Evangelize testability to development teams • Ask for a unit test for all check ins • Run Unicode compliance tests • Share testing with development team members • Automate testing wherever possible • Identify important coverage areas for I18n • Provide test APIs • Test locale dependent behavior at the code level • Doable for data generated by I18n libraries • Doable before localization begins
Efficient I18n Testing: UI driven automation • For larger UI driven tests (e.g. user scenarios) • Use automation frameworks: Selenium, Eggplant • Minimize such test cases – often unreliable for web apps under development • Use mocks to speed up server interactions • Subset core test cases to suit I18n needs • Features that do not handle character data are unlikely to break under different UI languages • Run tests for a set of representative UI languages – not all of them • Choose different script types: Japanese, Chinese, Korean, Cyrillic, Latin1/2, Greek, Arabic, Hebrew, Devanagari, Turkish, Thai • Use data sets from these representative languages: core test data • Minimum effective data and language set for i18n coverage • Provides good representative coverage for various language scripts • Shorten the time to run localized language UI test cases
Efficient I18n Testing: Use humans wisely! • Possible areas left for manual/exploratory i18n testing • Product/UI correctness for local markets • Complex functional testing scenarios • UI areas too new for automation • Language specific matters: CJK IME, Bidi UI/Layout, etc. • I18n Compatibility testing with other apps • Locale specific apps list • Not I18n but • Appropriateness for local markets • Local product testers • Translation quality • Linguistic reviewers/editors
Tools: what do we need? • Tools and APIs that would help i18n testing … and globalization processes • Some exist (company internally) and others should be coming • Hopefully good ones would be offered as open-source software
Tools: Data validation via I18n libraries Unit test your code: (before localization!) • Locale format validation tests • Any formatted data generated by a library should be unit testable with validating test classes • Open source libraries should come with such test suites for all format generating classes • Do they exist for ICU? If not, we should organize such efforts • Date/Time, Currency, Number/Decimal • Locale dependent function validation tests • Sorting validation • Time zones per locale • Calendar correctness: start date of the week, Calendar types • Segmentation • Transforms validation
Tools: Random Data Generation • ICU • UnicodeSet: • Pattern generation • Build data programmatically: e.g. Character properties • Random string generator with ExemplarSet characters for specified locales • Dangerous character generator • Define what they are • Generate them on demand
Tools: Real Language Data Generation • Provide: real/validated language data (not random character string) • For a large number of languages • Updatable database • Provide real language sentences – not just words • Representative Unicode compliance data (hand crafted) • Varied length data string • Provide custom locale-dependent format strings • e.g. possible to customize date format string from ICU • An API callable from within a test case/data • Where do we collect data? Most search companies have data such as: • Search key words • Language/encoding detection data • Translation training data • etc.
Tools: UI String Translation Tool • Most test cases should be language independent • But some test cases may need translated UI string for validation • (ex) If a product has a language switch setting, one can validate the switching by matching a few UI strings in the target language • Usually translation data base exist for projects • A tool that will generate equivalent strings for different locales from the translation data base • Write test cases with place holders for UI strings – load the string values from locale data files
Tools: Find language related problems • A tool that auto-detects a language of target pages • Helps with misplaced language pages: e.g. expects one language content but get another by mistake • A tool that crawls links and find dead ones or incorrect language ones • A tool that finds untranslated strings at a server build time • Exploit your translation infrastructure to come up with such a tool
Tools/Methods: Pseudo localization schemes • A pseudolocalization scheme that catches • usual things like unextracted strings, string concatenation, string expansion • Additionally finds: • undesirable syntax and constructions in the original • unhealthy dependencies among adjectives, nouns, relational particles (e.g. on/at/from …), and numbers • A scheme that works well for catching and debugging Bidi/Mixed text issues • Make pseudo locale as permanent part of your test environment! • Others?
Concluding summary • Large scale, fast development processes for web apps require efficient testing strategies • Test code directly wherever you can • I18n input/output testing is particularly suited for this approach • Share testing work with development teams • Testing is everyone’s business! • Use tools like Testability Explorer to measure test coverage • Use libraries for validation tests • Create tools for data generation and validation • Exploit your translation process and infrastructure to create tools that will help your testing and shipment
Concluding summary (continued) • Repetitive tasks should be automated • Consider sub setting your automated test cases to those directly affecting i18n • UI driven automated tests can take too long if a large number of languages need to be supported • Core language set idea: • Establish a core set of languages for your product • Use data from these core set of languages • For example: CJK, RU, Latin 1 language, HE/AR • Run these core language automated tests frequently. Run against a full set of languages less frequently • Leave humans for testing that requires analytic skills and thinking
Thank You! Q&A