Image:HS09_banner_klein.jpg

Data-driven and Hybrid Approaches to Machine Translation

From Fallschoolwiki

This is the course page for the course "Data-driven and Hybrid Approaches to Machine Translation", taught by Martin Forst and Alexander Fraser at the DGfS-CL Fall School at the University of Konstanz, September 7-19, 2009.

Contents

Day 1 Slides

Day 1 Slides

Day 1 Reading

Reading: Kevin Knight's Tutorial on Probability, IBM Model 1 and IBM Model 3

Be sure to read the beginning, which is background reading in probability, if you do not know this material. If you have time, try to read the parts about IBM Model 1, which we will cover in detail in the lecture of Day 2.

Assignment 1

Send the solution to me in email, fraser AT ims.uni-stuttgart.de

PLEASE PUT THIS IN THE SUBJECT FIELD OF THE EMAIL (substituting your last name)!!! "konstanz yourlastname assignment 1"

PART I (only if you have java):

Check whether you have a java interpreter by opening a command shell on your computer and typing "java" (without quotes) to see if something happens. If you have java, download one of these zip archives (pick a language you can read, preferably German or French if you feel you can understand short sentences):

German to English

French to English

English to English (English paraphrasing)

Unzip the archive into a new directory, and read the README file.

Run the alignment viewer (see README).

1) Look at the annotated sentences. (If you are using the English paraphrases, think of the sentence to the right as being in a foreign language). Pick two sentences where you are unsure if you agree with an annotation decision and say which sentences these are and why you question the annotation decision.

2) Are there annotation decisions you agree with which might hurt a machine translation system if it used them in a different context? Give an example.

3) Pick three sentences which are unannotated. Annotate them. Which decisions did you find difficult?

4) Send me the file align.out so that I can look at your annotations.


PART II:

Put five sentences from your native language (around 8 to 10 words in length) through Google Translate (one sentence at a time). Translate them to English (if your native language is English, use a different source language). Try to pick sentences such that Google Translate translates all the words, i.e., Google Translate does not leave words in the source language untranslated in the English output. Please say which language this is and provide:

1) the input sentence

2) the output sentence

3) a correction of the output sentence

4) your comments on what went wrong and why you think this happened

Be sure to provide at least one sentence which Google Translate correctly translates, and at least one sentence which Google Translate translates incorrectly.

Day 2 Slides

Day 2 Slides

Day 2 Reading

See Day 1 reading.

Assignment 2

Strongly preferred: Program IBM Model 1

OR

Not strongly preferred: Answer some questions about IBM Model 1

NEW NOTE: regardless of which homework assignment you do, you may want to start by convincing yourself that the incredibly simple estimation you do by running the main loop of the pseudo-code once gives the same results as explicitly enumerating the alignments in slide 36 (the slide where we calculated counts by working on four alignment functions by explicitly enumerating each one). You have to start with the t values on slide 36 to do this, and you apply them to just the pair of two word sentences on slide 36.

Day 3 slides

Day 3 Slides

Assignment 3

Build a Moses system

Day 4 Slides

Day 4 Slides

Day 5 Slides

Day 5 Slides

Day 6 Slides

Day 6 Slides

Assignment 4

(LARGELY STRING-BASE) ONLINE MT SYSTEMS

There are a number of machine translation systems that you can use online. The best-known are Yahoo’s Babelfish (http://babelfish.yahoo.com/) and Google’s online translation system (http://translate.google.com). Below are three texts. Please test both systems mentioned above by automatically translating these texts into German, French, Spanish, or another available language that you know enough about to judge the output. (If you choose a language other than German, French, or Spanish, please be very specific in your explanations, since I won't be able to judge the output myself.) Then answer the following questions:

1. Which of the systems comes closest to a correct translation for each text?

2. What kinds of errors do you observe and what might they be due to? Which ones could be avoided if the systems used (i) a perfect part-of-speech tagger, (ii) a perfect deep parser (e.g. XLE with a ParGram grammar or the English Resource Grammar (HPSG)), or (iii) a perfect system for anaphora resolution?

3. Can you find translations that are instances of the translation phenomena that we talked about in the course (head switching, argument switching, etc.)? How well do the systems handle these? Conjecture why the systems have problems with these, or how they might resolve potential problems. Are there bad translations that should have been translated by head switching, argument switching, etc., but were not? If so, which ones are these?

a. Newspaper text (Spiegel Online, September 8, 2009)

German Chancellor Angela Merkel is being criticized for running a boring election campaign. It may be part of a cunning plan to win by deterring opposition supporters from voting. The strategy could work – but may end up damaging German democracy.

b. Novel (Crime and Punishment by Fyodor Dostoyevsky, beginning of Chapter 2)

Raskolnikov was not used to crowds, and, as we said before, he avoided society of every sort, especially of late. But now all at once he felt a desire to be with other people. Something new seemed to be taking place within him, and with it he felt a sort of thirst for company. He was so weary after a whole month of concentrated wretchedness and gloomy excitement that he longed to rest, if only for a moment, in some other world, whatever it might be; and, in spite of the filthiness of the surroundings, he was glad now to stay in the tavern.

c. Manual (iPhone User Guide)

You can use iPhone to make calls in many countries around the world. You must first enable your carrier's service plan for international roaming. So that you can still make calls by tapping entries in contacts or favorites, you can set iPhone to add your country prefix automatically to phone numbers when you're calling from another country. When you're traveling outside your carrier's network, you may be able to choose among different carriers in the area where you're traveling.

Please submit your answers electronically to martin.forst AT microsoft.com, putting "konstanz yourlastname assignment 4" in the subject field of the email (substituting your last name)!!! Please include the translations you obtained from the online systems. Also, don’t hesitate to contact me when you have problems with the assignment.


Day 7 Slides

Day 7 Slides

Day 7 Reading

Reading: NAACL/HLT 2006 paper on "Grammatical Machine Translation" by Stefan Riezler and John Maxwell

Assignment 5

MANUAL TRANSFER RULE DEVELOPMENT -- German-to-English MT

I suggest that you work on this assignment in pairs. All participants with a sound knowledge of German are asked to make sure that the participants who do not know German have a German-speaking partner. Inversely, the people without sufficient knowledge of German are encouraged to let the German speakers know that they need a partner.

Before starting the exercises themselves, please download german-mt.tgz into the private directory that you use for this course and unpack it. You will obtain a subdirectory german-mt. Please create an alias called xlerc for the file xlerc_manual in this directory. Then open an Emacs window and start a shell in it by entering `M-X shell' (i.e. `Esc-X shell' on most keyboards).

EXERCISE 1 -- Writing transfer rules manually

The transfer rules in mt-manual-rules.pl can translate sentence 1 in mt-sentences.txt. Please try this out by starting xle from the directory where you put your copy of xlerc_manual (You start xle by simply typing "xle" onto the prompt of your Emacs shell and hitting Return.) and entering the following command:

 xfr-testfile mt-sentences-leftside.txt 1

This command will produce the four usual windows plus one additional window with an English f-structure that is the result of the transfer of the German f-structure. You will get the final translation by selecting `Generate from this f-structure' in the `Commands' menu of this additional window.

Note: Please ignore the warnings complaining about the library libxle-lm. We cannot use it in all operating systems at the moment, and we don't really need it either.

Note: Email me if these instructions do not allow you to parse the German sentence, transfer it into an English f-structure and generate from it. The transfer system depends on a number of settings which should normally all be set right, but it is sufficient that one is off for the system not to work.

Once you can translate sentence 1, your task is to add transfer rules to your copy of mt-manual-rules.pl so that all the other sentences also receive the translations given in mt-sentences.txt. Note: You can use the predefined macros and templates given in mt-rules.pl; they are designed to make your life easier. This is particularly true for the transfer of sentence 6. However, in this Exercise, it is up to you whether you use these mechanisms or not; the first goal is to get the right translations. Also, it is irrelevant for now whether the system picks the target string as the most probable one or whether the target string is just one of the other candidates.

As you extend the transfer rules, you will want to test them to see whether they actually improve coverage in the way you intend. To do this, you'll need to make sure your extensions are saved and then reload the transfer rules. It is not necessary to restart XLE. The command for reloading the transfer rules is:

 reload-transfer-rules 


EXERCISE 2 -- Transfer rule templates and rule union

If you haven't done so already in Exercise 1, go through the templates defined at the top of mt-manual-rules.pl and try to understand how they work. Then rewrite the rules you added in Exercise 1, so that you use at least three different templates. (Rules that consist of a simple template call do count of course.)

Try to use one of the templates for the transfer of the German verb "schmecken" into the English verb "like". Also, try to restrict the rule in such a way that no progressive forms of "like" are generated; in other words, introduce (a) fact(s) that set(s) the feature TNS-ASP PROG of "like" to -_. Note that you will have to use rule union in order to combine the template with the additional restriction. If you are unsure about how rule union works, refer to the rule for the transfer of the German verb "ab#statten" into the English verb "pay" when it has an OBJ "Besuch"/"visit", since this rule involves both a template and rule union.

Please submit your answers electronically to martin.forst AT microsoft.com, putting "konstanz yourlastname assignment 5" in the subject field of the email (substituting your last name)!!! Please attach your copy of mt-manual-rules.pl after renaming it yourlastname-mt-manual-rules.pl. And as always, don’t hesitate to contact me when you have problems with the assignment.

Day 8 Slides

Day 8 Slides

Day 8 Reading

Reading: AMTA 2006 paper on "Context-Based Machine Translation" by Jaime Carbonell et al.

Assignment 6

MT WITH TRANSFER RULE INDUCTION -- German-to-English MT

As with the previous assignment, I suggest that you work on this in pairs. This time, it is very imperative that all teams have somebody who speaks German among them. All German speakers are therefore asked to team up with those who don't speak German in order to make sure everybody can do this assignment. Thank you!

Before starting the exercises, please download german-mt.tgz into the private directory that you use for this course and unpack it. You will obtain a subdirectory german-mt. Please create an alias called xlerc for the file xlerc_extractor in this directory. Then open an Emacs window and start a shell in it by entering `M-X shell' (i.e. `Esc-X shell' on most keyboards).

EXERCISE 1 -- Running the system

Please start xle from the directory where you put your copy of xlerc_extractor. The sequence of commands in xlerc_extractor parses both sides of the phrase pairs in mt-phrases.txt and automatically induces transfer rules from these pairs; these are saved to the file rules.pl. Then the system tries to translate the sentences in mt-sentences-leftside.txt. You will see that it can translate sentences 1 and 2 correctly and sentence 3 almost correctly.

Note: Ignore the warnings complaining about the library libxle-lm. For licensing reasons, it cannot distributed as part of XLE, and we don't really need it either.

Compare rules.pl to your version of mt-rules.pl of the assignment 5. Can you find systematic differences between the rules you wrote and the ones the system infers? Which are these differences?

EXERCISE 2 -- Extending the phrase dictionary

The objective in this exercise is again to cover all seven sentence pairs given in mt-sentences.txt. However, this time we will not define new transfer rules directly, but we will extend the phrase dictionary from which the system induces transfer rules.

Familiarize yourself with the dummy categories used in mt-phrases.txt. They are defined in german-2009-06-29/german-mt.lfg and in english-2009-02-27/english-mt-lex.lfg, where you may also find a few additional dummy categories that are not used in the initial version of mt-phrases.txt, but may be helpful for its extension.

Finally, extend mt-phrases.txt with phrases that you believe to be suitable for the induction of transfer rules that can translate all seven sentences in mt-sentences-leftside.txt. Which of your rules deal with head switching? Which ones deal with argument switching?


Note: For the translation of individual sentences, you can use the following commands in this setup.

 translate {NP1nom schnarcht.}
 translate-testfile mt-sentences-leftside.txt 1
 ...

Please submit your answers electronically to martin.forst AT microsoft.com, putting "konstanz yourlastname assignment 6" in the subject field of the email (substituting your last name)!!! Please attach your copy of mt-phrases.txt after renaming it yourlastname-mt-phrases.txt, as well as the output of your final translation run over mt-sentences.txt, which should be named yourlastname-translations.txt. And as always, don’t hesitate to contact me when you have problems with the assignment.

Day 9 Slides

Feedback concerning Assignment 4

Hands-on work on Assignments 5 and 6.

Day 10 Reading

Paper on Yvette Graham's machinery for the induction of packed transfer rules from aligned f-structures

Pretty detailed paper on Yvette Graham's RIA tool (open source!) for induction of transfer rules from aligned f-structures

Day 10 Slides

Day 10 Slides

Yvette Graham's slides from the LFG 2009 Conference