Cross-lingual information retrieval using compound word splitting

Finding web pages written in a foreign language may be a difficult process when using online search providers. This is because the information needed normally has to be formulated in the same language. The field of cross-lingual information retrieval seeks to ease this challenge by handling the gap between information in one language and information demand in another. Literature were found to introduce means to deal with different aspects of this language gap, and one particular framework was found to combine some of these. This thesis focuses on building a framework to solve other aspects that are demanding.

The framework implemented comprises query translation and document retrieval. Particularly the handling of compound words is analysed to improve query translation. The approach to compound word splitting attempts at being language independent and combines the word length feature with usage of a training corpus. Experiments were conducted to evaluate the splitting of compound words both separately in Norwegian and in context of cross-lingual document retrieval with translations to English, Spanish and German.

Experiments found that taking the word length feature into account improved an otherwise purely statistical approach to compound word splitting. It was also found that compound word splitting could be avoided in some cases when used in cross-lingual document retrieval. Various improvements are possible, such as better tuning of the compound word splitter, detection of rare cases in which compounds are formed, or to deal with the problems that occur in document retrieval when splitting up a compound word.

Utgiver

NTNU