Practicing linguistic fieldwork with AI

It’s common practice to train aspiring linguists in field methods by bringing a speaker of a non-familiar language to class (non-familar to the students of course). Training usually focuses on elicitational methods, which means that the students keep asking questions about the target language to catch a glimpse of its structure. At this early stage, common questions evolve around translating words and sentences from a shared language into the target language. Experienced field workers know the pitfalls and downsides of translational tasks, but even they acknowledge that it’s a good starting point before diving deeper.

However, depending on where you are, speakers of appropriate languages are often not easy to find. Coming regularly to a field work class is time consuming and not everyone enjoys giving repetitive translations of boring sentences. That’s why I wanted to find out if we can use AI to do the job instead. After all, AI is available 24/7 and never complains about stupid tasks. This article gives a short outline of my experiments.

Getting started

Anyone who ever experimented with AI knows that you can spend endless hours in front of a screen. So first I set a reasonable time limit for experimenting:

How much can I find out in 2 hours of elicitation?

Two hours seemed reasonable. My informants in Papua become pretty tired after 2 hours of intense elicitation (I myself would not last as long if I were in their shoes).

As for the language model, I will be using ChatGPT 3.5 which is the most advanced LLM (large language model) available for my free account in Europe. The crucial point is of course which language will be investigated. In a realistic field class, students usually don’t know which language they encounter. They are expected to start from scratch without any preassumptions and without being able to look things up. For my AI experiment, I also wanted to be ignorant about which language I would be investigating. I came up with the following prompt:

[A:] I would like to practice linguistic fieldwork. In order to do this, we will simulate an elicitation session. My role is the linguist, your role is the speaker of the object language. Our shared contact language is English. I do not understand your language but I will ask you simple questions that are mostly translation tasks, such as “How do you say: ‘I sleep.’ in your language?” You will answer very precisely and short in your language. I want you to chose a language that is not an Indo-European language as your mother tongue. Do not let me know the name of the language until I ask you. Let’s start our elicitation: How do you say “I sleep”?

And it started right away:

[ChatGPT:] Miwadhi.

Fine, it seemed to work. So I decided to go through the verb paradigm:

[A:] You sleep.
[ChatGPT:] Amanye.
[A:] He sleeps.
[ChatGPT:] Wanaye.
[A:] She sleeps.
[ChatGPT:] Wanaye.
[A:] We sleep.
[ChatGPT:] Tutaye.

As you see, we quickly came to a good routine. You can go through paradigms quite rapidly this way. A closer look made me wonder though. The last four forms seemed pretty regular, whereas Miwadhi, the term we started with, shares no similarity. There’s two possible explanations: (a.) the verb has a suppletive form in 1st person singular or (b.) that’s an inconsistency and ChatGPT used a different verb here. The latter would not come as a big surprise. Inconsitencies like these you’d also get from human informants. Just because you (as a linguist) think in paradigms and tables does not mean your interlocutor does. In a real world situation I would leave a qestion mark on Miwadhi and just continue. There’s always time to come back later.

However, the next hurdle is right around the corner. It seems, English is not the best choice for the contact language, since it’s difficult to distinguish 2nd person singular from 2nd person plural. That’s what I tried:

[A:] Y'all sleep.
[ChatGPT:] Ninyi mnapotea usingizi.
[A:] You (plural) sleep.
[ChatGPT:] Mnapotea usingizi.

Assuming that the responses for the other number-person values were single words, this is not satisfying. With human interlocutors I could use pointing gestures to indicate what I mean but here we are restricted to translation tasks. (Although, I have to admit that humans are also sometimes confused when they should translate literally without changing perspective. How often have I prompted something like “I sleep” only to find out later that the speaker offered my “You sleep” in his/her language.) I am sure, there is a good workaround for this problem, but since I have a time limit, I set this aside and do not bother with 2nd person plural anymore.¹

Some findings

My experiment continues with translation of simple sentences. I elicit declaratives like ‘I eat meat’, ‘You like meat’, ‘He sees meat on the table’, ‘The man sees the woman’, ‘The child eats meat’ and so on.

The big pitcure of the (still anonymous) object language emerged quickly:

the constituent order is: SVO and SV
verbs are morphologically complex with prefixing as dominant morphological process
prefixes of transitive verbs follow the order: SBJ-TAM-(OBJ)-root
numbers stand after the head noun: nguruwe watatu ‘three pigs’, nguruwe wawili ‘two pigs’

After around 1,5 hrs of elicitation I could write down some nice tables. Here is a pronoun paradigm that covers motion towards someone referent:

kwangu ‘to me’
kwetu ‘to us’
kwako ‘to you’
kwake ‘to him’
kwao ‘to them’

Subject prefixes in the initial position of the inflected verb are:

ni- ‘1SG’
tu- ‘1PL’
u- ‘2SG’
a- ‘3SG’
wa- ‘3PL’

From the TAM markers, we have at least the following:

na- ‘present tense’
li- ‘past tense’
ta- ‘future tense’
…

The construction used to express possession as translational equivalents of engl. “I have / you have / he has …” seems to consist of subject and TAM prefixes without any verb root:

nina ‘I have’
una ‘you have’
wana ‘they have’
…

Finally, I got some basic vocabulary (I left out the verbs since I do not know a good citation form yet):

Language X	English
nyama	meat
nzuri	good
baridi	cold
laini	soft
mezani	(on the) table
mtu	man
mwanamke	woman
mtoto	child
wanawake	women
watoto	children
wanaume	men
tufaha	apple
nguruwe	pig
samaki	fish
nyumba	house
maji	water
pale	there
hapa	here
mama	mother
baba	father

Some of these items are probably inflected forms, so there is still plenty of work ahead to find out more about these lexemes. ALso, plural formation of nouns is not clear yet from such few examples.

The resolution

Now that we got to the end of our experiment, it is time to resolve which language it was.

[A:] Now it is time to tell me which language you were speaking.
[ChatGPT:] I was speaking Swahili, a Bantu language widely spoken in East Africa. It is one of the official languages of several countries, including Kenya, Tanzania, and Uganda, and serves as a lingua franca in the region.

To be honest, I kind of expected it to be Swahili after around 10 minutes. After years of reading literature in linguistics, one recognises the larger languages quite well. Nonetheless, students will not identify the language as easily and the final resolution will be more revealing to them.

At the end, the initial question about the form miwadhi comes to my mind. So, let’s ask right away:

[A:] What does the Swahili word "Miwadhi" mean?
[ChatGPT:] I apologize for any confusion caused. "Miwadhi" is not a word in Swahili or any other Bantu language that I'm aware of. It seems there might have been a misunderstanding in my responses. If you have any more questions or need further assistance, feel free to ask.

Just like with human informants, it’s good to double check all your data.

Conclusion

It’s really impressive, how good this method works. The individual questions are mere translation tasks, so one would get similar results practicing with machine translation systems like GoogleTranslate or DeepL. However, the question-answer style of AI chat bots simply feels more like working with real humans than mere machine translation. Most intriguing is the way we can let ChatGPT choose the language without telling us. It will choose a language that it is trained on, so one can imagine that the choice will always be from the largest languages. Furthermore, I guess it will choose a language that works well with latin script (so Mandarin or Japanese should be unlikely) if the prompt that starts the process is written in this script.

I can highly recomend anyone interested in linguistic field work to start experimenting themselves. This type of practice can become very useful, especially in the training of your students.

Besides that, it would be a challenging task to come up with good prompts to check for dual number and more fancy stuff that has no simple equivalent in English.↩︎