Static Analysis for finding I18N Issues

Static analysis tools are useful for quickly identifying issues with your code. Recently, my team has been evaluating Globalyzer, a static analysis tool for finding software internationalization issues. Globalyzer will find such issues as hard coded strings, or hard coded date/time formats. Our aim is to move the discovery of I18N issues from the localization QA phase, upstream to the development phase.

My first impressions are good. Globalyzer is quick to setup, easy to use, and scans code unbelievably quickly. I’ve scanned a project with 350,000 lines of code in less than a minute.

The key to an effective rollout of Globalyzer is developing rule sets per product. Rule sets essentially define the scanning criteria for a code base. For example, for a C++ project, we would start with the built in rule set for C++. Then after an initial scan, we analyze the results, and we may want to update the rule set. For example, we may want to configure the rule set to ignore calls to any logging functions, since we don’t localize log messages.

I’m hoping that we get the following usage from Globalyzer:

  • Globalyzer should help us get involved earlier in the development cycle.
  • Lots has been written about Agile localization. Globalyzer should help us certify sprints as I18N ready, for later localization.
  • We should find any serious I18N issues early in the project – thus reducing fire-fighting later on.

Longer term, I’d like for us to look at integrating Globalyzer with our build systems. I’ll be posting here periodically on our progress, and my experiences with using static analysis tools to find code level internationalization issues.

Basic Software i18n Check List

Part of something my team is currently working on involves defining a standard approach to assessing the world-readiness of a piece of software, be it a brand new product or a legacy product that has existed for a long time.

If your codebase is not correctly internationalized, you will face many problems if you ever want to localize your application for another market.

Most of these are general items that could apply to any product on any platform, be a it a website or a mobile application.

Here’s my first draft of my list. This may help you if you are planning of localizing your application to sell in other markets.

Strings

  • Have all strings been externalized?
  • Have all strings been approved by someone who is a native English speaker?
  • Is there overuse of any abbreviations or acronyms, e.g. ‘App Info’?
  • Is there anywhere that you are concatenating strings together?
  • Punctuation is text too! Are you using any hard-coded punctuation marks (.,!?) these need to be externalized?
  • Have you identified all trademarks, tokens and other ‘special’ elements in your text and communicated to localisation that these shouldn’t be translated?
  • For mobile applications, is your text short enough to fit the potential small screen size of mobile devices? Consider string length expands generally 20% – 30% after translation. Some phrases expand by a lot more than 30%. Abbreviations and acronyms are the worst offenders here.
  • Try to keep resources that shouldn’t be translated separate from those that should.
  • Don’t embed HTML mark up in string resources.

Source Files

  • Source files are a standard file format – no need to invent new resource file formats. A standard resource file format will exist for the platform for which you are developing.
  • Source files are UNICODE.
  • You need to be able to generate your localizable file set in one step, it’s ideally handled by the build process.

General UI

  • Can input fields accept accented characters?
  • Can input fields accept double byte characters?
  • Is there room for buttons to expand if necessary?
  • Are there any UI elements like check boxes or radio buttons that are placed across the screen? It is better to place these vertically.
  • Try to put labels for controls above the control, rather than beside it.

Web Applications

  • Is server set up to send UTF-8 character set?
  • Are all buttons resizable?
  • Are there any mouse-over buttons with text in them?
  • Are other elements on the screen able to expand in size?
  • Are links to help and other user documentation appropriately localized?
  • For HTML based UI, never use absolute positioning.
  • NEVER use HTML tables to achieve layout.
  • Try to avoid using fixed widths with HTML layouts.
  • Don’t use the ‘align’ property to align items left or right.
  • Externalize CSS to a separate file – never use inline CSS.

Images

  • Is there text in any image?
  • Are there any images that depict hand, body or facial gestures? e.g, the thumbs up gesture is offensive in some cultures.
  • Are there any icons or images that are very specific to one region? e.g. an American type mailbox indicating an email.

SQL

  • Use the nvarchar type to store all text.
  • Can all insert and update statements handle unexpected apostrophes? (O’Sullivan, d’accord for example)?
  • Can all insert and update statements handle accented and double byte characters?
  • For MySql, ensure it was installed with UTF-8 as the default character set.

Numbers & Dates

  • Do any parts of your application format numbers using a thousand separator or a decimal point? If so the formatting string must be read from the OS.
  • Do any parts of your application format currencies? If so the formatting string must be read from the OS.
  • Does changing the application locale change all numbers appropriately?
  • Are all time and date formats read from the OS based on locale?
  • Does changing the application locale change all dates appropriately?

Regular Expressions

  • Do any regular expressions use the construct [a-zA-Z]? These are not safe from an i18n point of view.
  • Never use a regular expression to find a price, as the currency will be different for different languages.

Collation/Sorting

  • Do any parts of your application sort alphabetically? If so then they need to be tested for i18n awareness.