Andrew Cunningham
Information Systems Librarian
Maribyrnong Library Services
andjc@ozemail.com.au
Community Networking Conference 1999 : Engaging Regionalism
29 September - 1 October, 1999
Ballarat University, Victoria, Australia
Unicode is a multilingual and multiscript character set. The use of Unicode makes it possible to create multilingual web pages in community languages. Multilingual public internet access provides the opportunity to create a new paradigm for LOTE Collection Development and LOTE service provision.
The Maribyrnong Library Services web site was initially a trilingual web site with separate HTML documents in English, Vietnamese and Chinese (Traditional). Unicode web pages have now been developed. This has allowed us to create HTML documents that contain English, Vietnamese and Chinese. Recently Amharic text has been added, and additional languages will be included in the future.
This paper will explore the potential use of Unicode in public library web sites, current limitations hindering the use and deployment of Unicode HTML documents, and strategies available in the HTML4 and the Cascading Style Sheets specifications in overcoming some of these limitations.
Each day public libraries deal with requests for information in languages other than English (LOTE). In order to cater for the needs of their clients, many public libraries collect a range of materials in other languages. The ability of public libraries to collect in other languages is constrained by finances and availability of non-English language materials.
The general trend is for libraries to cater for the larger language groups in their community, if materials can be purchased for that language group. The advent of public internet access in public libraries has begun to alter the way libraries address the provision of multicultural library services.
From the beginning, public internet access has been multilingual in Maribyrnong. Initially, we concentrated on the provision of multilingual internet access to the Vietnamese and Chinese communities. The Vietnamese and Chinese languages are the largest of our community languages. Vietnamese and Chinese fonts and input method editors were installed on all public internet workstations.
Maribyrnong Library Service's web site (http://library.maribyrnong.vic.gov.au/) was trilingual. Each part of the site was translated from English into Traditional Chinese and Vietnamese. The web pages were monolingual, but it was possible to click on an icon on any page and move to the corresponding pages in the other two languages. The web site contained information about the facilities and service offered by the library service. The English component used the ISO-8859-1 character set, while Chinese used Big5 and Vietnamese used VISCII.
Shortly afterwards, our focus changed. We still worked to expand the internet access of the languages we catered for, but we also realised that the internet provided an opportunity for the provision of services to language groups that were below our criteria for collecting. The internet allowed us to cater for the emerging languages in our community.
The first shift in the development of the web site was functional. Most people who use public internet access in our library use external sites, rather than pages on the library's web site.
It was decided to develop a web page (http://library.maribyrnong.vic.gov.au/utf8/), for these users that provided a jumping of point into the internet. The page contained the mission statement of the library service and links to search engines and to InfoWest (http://infowest.maribyrnong.vic.gov.au/), a community information network developed by the City of Maribyrnong. The first version of the page was trilingual, containing sections in English, Vietnamese and Chinese.
In order to provide these languages on the same page, it was necessary to use Unicode. Unicode (http://www.unicode.org/) is a multiscript character set allowing the use of many of the languages of the world.
The web site used HTML 4.0 and Cascading Style Sheets Level 1 (CSS1). Where possible, The WAI web accessibility guidelines were followed. The use of HTML4, CSS1 and unicode meant that Internet Explorer 4 or Netscape Navigator 4 was required to view those pages as designed.
The unicode page was designed to be used within the library where we had full control over the software and computers being used to view the web page. This allowed us to specify what fonts were to be used for different sub-sections of the Unicode repertoire. The Unicode character set is so large that very few type foundries produce complete Unicode fonts. Most unicode fonts contain subsets of the Unicode character set. It is therefore sometimes necessary to use a few fonts in order to display multiple languages.
The second stage of the unicode pages was to develop resources for the emerging language groups in our community. Within the past two years there has been a growing number of immigrants from the Horn of Africa. There is a low level of English literacy among these immigrants and the availability of resources in their languages is negligible. Library staff were able to locate materials in the Amharic language on the internet. Amharic is the national language of Ethiopia and is also spoken in other parts of the Horn of Africa.
A freely available unicode font that supported the Ethiopic script was located. The library's mission statement and key phrases were then translated into Amharic. The Amharic text was added to the unicode page and a second web page listing Amharic resources was created (http://library.maribyrnong.vic.gov.au/utf8/et.html). A set of links to English news sites about Ethiopia were also added to this web page. The Ethiopian links page to Amharic and English news resources is popular with our Ethiopian patrons.
Next we added Arabic to the primary unicode page. A link to The Open Road's (http://www.openroad.vic.gov.au/) Arabic links page. Recently a Somali section was added to the primary unicode page. This is linked to a page with links to Somali news resources in Somali and English.
There are many good resources about web site design, so I will not go into any great detail. I'll briefly discuss a couple of aspects that relate to the design of multilingual sites.
It is very important to clearly identify the purpose of the site before you begin to design it. The purpose of the site and the audience the site is intended for will shape the design of the site.
It is very important to identify the audience you wish to create the site for. You need to "identify the potential users of your Web site, so that you can structure the site design to meet their needs and expectations" (Lynch & Horton, 1997b).
What languages do you want to provide information in? Is the material going to be accessed directly by the user or will access be mediated by a caseworker, a health professional, etc. Will the entire site be translated or will only key documents be translated.
The answers to these questions are dependent on who you intend the resources to be used by. If you intend the resources to be directly accessed by NESB clients, what is there English literacy level?
It is also important to take into consideration the cultural and political dimension of your potential users. When we were developing the Amharic news resources page we decided that the page should contain two sets of links. The first is a set of links to news resources in the Amharic language. Most of the news stories carried by these sites were written in Ethiopia. The second set of links was to non-Ethiopian English language news sites. We found that many of our borrowers wanted to be able to read news in their won language and at the same time obtain an independent view of the events affecting the Horn of Africa.
Computers treat characters as numbers, a character set provides a mapping between a character and a number (code point). Some languages have many character sets you can choose from. The encoding is the way those codepoints are rendered as a sequence of bytes to a computer. For most character sets, there is a one to one relationship between the character set and the encoding. So many books and resources on the web seem to speak of character sets and encodings interchangeably, even though they are two very different things.
The distinction becomes important when working with Unicode or Japanese. Unicode is a character set. There are a number of different encodings that allow Unicode characters to be rendered as a sequence of bytes. The main encoding on the web that is used for Unicode is UTF-8. When you are creating a Unicode web page with Unicode aware software, it is important to check that the software can save the document as an UTF-8 encoded document rather than in some other encoding.
Once you have chosen the languages you wish to provide resources in, it is necessary to decide what character set you wish to use. It is important that the character set you choose to use is widely used by your target audience or that it is freely available for download. The World Wide Web Consortium has a list of some languages and the character sets commonly used: http://www.w3.org/International/O-charset-lang.html.
On community sites I'd restrict myself to font solutions that are available for free from the internet. If you use a solution that requires your target audience to obtain commercial software, then its not likely that your web site will be used.
Wherever possible language issues should be as transparent as possible.
It is common practice among many non-English web sites to provide a link to software and fonts that can be downloaded fro free to aid them in accessing these sites. Some sites also provide instructions. Sometimes these instructions are in English; sometimes the instructions are in their own language. In this case the instructions are usually a set of images of text, rather than actual text.
I've found the best way to prepare English text for translation is to write the text in plain English. I intend to include clarification notes on any material that may be confusing to or maybe be misunderstood by translator.
When preparing the English text to be translated into Amharic for the Maribyrnong site, I used the expression "Ethiopian resources", the translator interpreted it to mean "Ethiopian economic resources". This was a valid translation, but was not what I meant. The intent of my poorly worded English version was to say "Ethiopian information resources". The term I used was ambiguous jargon. Eventually, we settled for an Amharic translation of "Ethiopian news" which is closer to what the web page was suppose to provide.
Organisations need guidelines for preparing text for translation. It is best to avoid technical or professional jargon where possible, or if necessary, provide a glossary of terms.
I also find it useful to get someone to proofread the translation, in order to ensure that translation is appropriate or accurate. Unfortunately it's not always possible to do this.
One major problem with obtaining translations is the format the translation is received in. In the past I've always received translations in a printed form rather than electronic form. I've been luck to have people on hand to type in the translations. It would be preferable to receive it as a text file, but it requires that you and the translator have software capable of supporting the same character sets and encodings. Beware of problems that may occur if the file was generated on a different operating system.
When designing pages, beware that text size and direction will vary in translation (Yergeau & Dürst, 1999), affecting:
It is also important to remember that it is not only text that needs to be translated. If you have graphs or images that include text then that text needs to be translated and the graphs and images modified to include the translated text. Translating text in images may increase the size of the images.
When translating material for a web site it is necessary to remember that there are other parts of the web site that may require translation services. CGI scripts, javascripts, java applets, and ActiveX may have strings that may need to be internationalised, translated or localised.
Another area is multimedia. If the site uses sound or video, it may be necessary to get the material dubbed or localised.
Current literature, such as Bishop (1998) and Rockwell (1998), is aimed more at business and marketing aspect of the global web. Ie, site internationalisation, translation or localisation, rather than multilingual sites.
Internationalisation, as it seems to be used, refers to the process of making a web site culture independent or culture neutral. It usually doesn't refer to translating the site into another language. Most internationalised sites remain in English, and culture specific references are removed in order to make the site more accessible for people form other countries.
Localisation refers to the adaptation of the contents to the audience's culture(s). This is not just a case of translating the text into the target language. The web site is optimised for the target culture.
There are two factors influencing the architecture of the web site: access mechanism and degree of multilingualism.
There are three types of access mechanisms for web sites that provide multilingual resources.
Multilingual web sites can be divided into three categories:
Site navigation is a major issue with multilingual sites. Yergeau & Dürst (1999) identified three common navigation aides on multilingual sites:
Specifying the language of a document in the filename is a useful aid to site administration. It allows quick and easy differentiation between different language versions of the same document. Yergeau & Dürst (1999) suggest inserting the two letter ISO language code between the filename and file type extension, e.g. index.am.html or index.fr.html, or index.ru.html. Although such an identification system would cause problems when the site is designed on windows machines.
It is possible to set the options/preferences in Netscape Navigator 4 and Internet Explorer 4 & 5 to specify the preferred language you would like to receive a document in if there are different language versions. Some web sites can be configured to handle language negotiation, so that when a browser requests a file, the preferred language version of the file is sent to the browser.
Most web servers that support language negotiation require a specific naming convention or site structure convention.
Do not translate file names. It makes web site administration easier if all the filenames are in the same language.
It is best to adhere to standards when designing a site. The site should be universal and work on any browser that can support the languages and encodings you are working with. You will need to decide which version of HTML you will use as your base. HTML 4 is better for creating non-English web pages.
The site shouldn't be designed for a particular screen size or resolution. Remember that there are many smaller monitors being used to surf the net.
It is usually best not to specify fonts. There are exceptions. The target audience for Maribyrnong's unicode pages are internal users, this allows us to have tight control on the configuration of the workstations.
Since most unicode fonts are subsets, it may be necessary to specify fonts for particular scripts.
HTML 4 allows the specification of the language being used. Many elements can have a lang attribute. The lang attribute specifies the language that the contents of the element are written in.
The value for the lang attribute is based on the syntax outlined in RFC 1766 (ftp://ftp.nordu.net/rfc/rfc1766.txt). The value is made up of a two letter language code from ISO 639 (http://www.sil.org/sgml/iso639a.html) with an optional specification of a country or dialect. The two letter country codes are based on ISO 3166 (http://www.w3.org/international/O-misc-iso3166.html). You can also use IANA codes. E.g. English: en; American English: en_US; British English: en_UK; Chinese: zh; Hong Kong Chinese: zh_HK; French: fr.
If the entire document is in English you could put the lang attribute in the html element:
<html lang="en">
or it could be place on other tags.
<p lang="en"> a lot of English text
<span lang="fr"> some French text</span>
English text continues</p>
A W3C note, Primary Language in HTML (http://www.w3.org/TR/NOTE-html-lan), outlined the use of the meta element to specify the language of the document. The meta element would look like
<meta http-equiv="Content-language" content="en">
A number of languages could be specified by
<meta http-equiv="Content-language" content="en,ar,am">
The value of the language would be the same as what would be given to the lang attribute if the lang attribute was used.
The HTML 4 specification allows you to specify the character set of the html document. The character set is specified by a meta element.
<meta http-equiv="content-type" content="text/html; charset=utf-8">
or<meta http-equiv="content-type" content="text/html; charset=iso-8859-2">
It is possible to specify the directionality of the text. The dir attribute takes a value of 'ltr' (Left to Right) or 'rtl' (Right to Left). The default is 'ltr'. It is important to include if you are including text in the Arabic or Hebrew scripts.
<body lang="en" dir="ltr">
<p>A paragraph of English text</p>
<p lang ="ar" dir="rtl">A paragraph of Arabic text</p>
<p>A further paragraph or two of English text
<q lang="ar" dir="rtl">An Arabic quotation</q></p>
</body>
The bdo element alters or reverses the default algorithm used for language and display direction.
<p>an English paragraph</p>
<p><bdo lang="iw" dir="rtl">Some Hebrew text</bdo></p>
<p>This is another English paragraph</p>
It is useful when you need to force a change in the display of text.
I usually restrict myself to using CSS1, since CSS2 is still to be fully implemented in web browsers. Its use for multilingual web pages, is in unicode pages, where you need to specify the fonts being used to display text.
For example a web page in Amharic and English may have the following style:
<style>
body {
background-color: white;
color: black;
font-family: Verdana,Tahoma,Bitstream CyberBase,Arial;
}
.am {
font-family: 'GF Zemen Unicode';
}
</style>
And the body of the web page may look like:
<body>
<h1>a English heading</h1>
<h2> a English sub heading</h2>
<p>a English paragraph</p>
<h2 class="am">An Amharic sub heading</h2>
<p class="am"> an Amharic paragraph</p>
</body>
Style sheets are also useful for controlling the line-height between Chinese text, otherwise two lines of Chinese text nearly run into each other.
<style>
zh {
font-family: 'UWCXMF (Big5)',MingLiU,MSung-Big5;
line-height: 120%;
}
</style>
In HTML 3.2 and earlier versions of HTML, the document character set that HTML used was ISO-8859-1 (Latin1). It was possible to specify a character in the ISO-8859-1 character set by numerical reference to its code-point in the character set.
HTML 4.0 uses ISO-10646-1 as the document character set. The basic multilingual plane of ISO-10646-1 is a codepoint for codepoint equivalent of Unicode. All numerical references to a character are assumed to be to that codepoint in the ISO-10646-1 character set.
The HTML 4.0 recommendation allows the user to specify the codepoint using either a hexadecimal or a decimal numerical reference. Although the main web browsers only seem to recognise character references using decimal numbers.
A numeric entity is prefaced by &# then the number of the desired ISO-10646-1 codepoint. Followed by a semi-colon. E.g. Ζ is the Greek capital letter Zeta., … is a horizontal ellipsis.
The Maribyrnong Library Service uses a range of software to create multilingual web pages. There is a range of software available which is useful for editing unicode web pages:
A possible global style sheet could look like:
body {
background-color: white;
color: black;
font-family: Verdana,Tahoma,Bitstream CyberBase,Arial;
}
The stylesheet for the body specifies a set of fonts that support a range of languages. All the characters required for English, Somali and Vietnamese. Therefore it will not be necessary to create a special class to specify fonts for Vietnamese and Somali.
a {
color: green;
text-decoration: none;
}
p {
text-align: justify;
margin-left: 1.5cm;
margin-right: 1.5cm
}
p.centre {
text-align: center;
line-height: 120%
}
.am { font-family: 'GF Zemen Unicode'; }
.zht {
font-family: 'UWCXMF (Big5)',MingLiU,MSung-Big5;
line-height: line-height: 120%;
}
.zhs {
font-family: 'UWPSTJ (GB)',MS Song,MS Hei,MSung-GB;
line-height: line-height: 120%;
}
.ar { font-family: Traditional Arabic,Arabic Transparent,Bitstream CyberBase; }
This set of classes enables the specification of the appropriate unicode fonts to display the script in question. The classes support the following scripts and langauges:
span { font-family: Verdana,Tahoma,Bitstream CyberBase,Arial; }
The span element was included in the style sheet so that when an English word or phrase occurs in non-Latin text, it is possible to specify a font just for that word or phrase.
h1 { text-align: center; color: black; }
h2 { text-align: center; color: black; }
The HTML document would contain a link to the style sheet.
A possible HTML document:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN">
<html>
<head lang="en" dir="ltr">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta http-equiv="content-language" content="en,zh_HK,vi,am,so,ar">
<title>A sample page</title>
<meta http-equiv="keywords" content="an appropriate list of words">
<meta name="Author" content="Jane Doe">
<link rel="stylesheet" type="text/css" href="multiling.css">
</head>
<body lang="en" dir="ltr">
<h1>English Title</h1>
<p>Lots of English text
<br><a href="whatever url" charset="iso-8859-1"> English description</a></p>
<h1 class="am" lang="am" dir="ltr">Amharic Title</h1>
<p class="am" lang="am" dir="ltr">Lots of Amharic text
<br><a href="whatever url" charset="x-etascii">Amharic description</a></p>
<h1 class="ar" lang="ar" dir="rtl">Arabic Title</h1>
<p class="ar" lang="ar" dir="rtl">Lots of Arabic text
<span> a few English words </span> more Arabic text
<br><a href="whatever url" charset="iso-8859-6">Arabic description</a></p>
</body>
</html>
Bishop, Mark (1998). How to build a successful international web site. Scottsdale, AZ : Coriolis Group.
Doherty, Will (1999). "Creating multilingual web sites". Multilingual computing and technology. 10 (3), p.34-7.
Lieu, Tina (1999). "Unicode 3.0 arrives". Multilingual computing and technology. 10 (5), p.50-2.
Lynch, Patrick J. and Horton, Sarah (1997). CAIM/Yale web style guide. [http://info.med.yale.edu/caim/manual/]. Viewed: 18th August 1999.
Rockwell, Browning (1998). Using the web to compete in a global marketplace. New York, NY : John Wiley & Sons.
Sheridan, E. F. & Simons, George F. (1998). Going global online : monitoring your cultural presence in cyberspace. The Web of Culture. [http://www.webofculture.com/home/analysis.html]. Viewed: 30th July 1999.
Technical aspects of web translation (1999). AvantPage. [http://www.avantpage.com/web-technical.html]. Viewed: 3rd August 1999.
Vehovar, Vasja et al (1999). "Language as a barrier". INET'99 Internet Global Summit. San Jose, CA : ISOC. [http://www.isoc.org/inet99/proceedings/3i/3i_3.htm]. Viewed: 1st August 1999.
Yergeau, François & Dürst, Martin (1999). "Weaving the multilingual web". 15th International Unicode Conference. San Jose, CA : Unicode Consortium. [http://www.w3.org/Talks/1999/0830-tutorial-unicode-mjd/]. Viewed: 31st August 1999.