Linux terminal for linguists: The grep command.

Introduction

It’s amazing how much work can be done in a simple terminal. The avarage Linux user can do tasks with a few command lines for which a standard Windows user needs to click extensively through confusing GUIs. One such powerful terminal command is ‘grep’ (global regular expression print). In this tutorial, I will show how to make use of ‘grep’ to handle some simple #morphology tasks.

If you want to practice hands-on, I recommend you to download this English wordlist from GitHub. There you can find a plain textfile “words.txt” which is basically a huge wordlist of English vocabulary. Each word is presented in a separate line.

You can print the entire wordlist in your terminal (standard ouput) with the following command (I’ll abbreviate outputs that are too long with […] here):

foo@bar:~$ cat words.txt
[...]

The file is pretty large, so there is no point in looking up for words scrolling up and down your terminal. What we need is a filter and this is exactly where the ‘grep’ command comes in handy.

Pipes

Before we can utilize the full potential of grep, we need to learn a little about pipes in Linux. A pipe is a mechanism to join two terminal commands (i.e. applications) so that the output of the first becomes the input of the second. A simple example is the ‘wc’ command that can be used to count words from some input. If we pipe it to the ‘cat’ command with our file as argument, the lines of the file are no longer printed in the terminal but serve as input for ‘wc’. In the following example ‘-l’ specifies that ‘wc’ counts the lines from the input:

foo@bar:~$ cat words.txt | wc -l
466550

As we see, the file contains 466,500 lines. Since each line has exactly one word, this is quite a large inventory of words.

grep basics

Let’s say I want to check if the wordlist contains the word applesauce. I can use a pipe to use ‘grep’ piped to ‘cat’:

foo@bar:~$ cat words.txt | grep applesauce
applesauce

There it is: a plain and simple applesauce is filtered from the wordlist and printed in the terminal. As you can see, the argument that ‘grep’ takes is the string for which the input is filtered. One can also use the filtering to exclude matches. This is done with the argument ‘-v’. The following command returns all words from the list that include the letter ‘e’:

foo@bar:~$ cat words.txt | grep -v e
[...]

Actually, the grep command does not need pipes at all. We can get the same result by specifing the filename as the second argument:

foo@bar:~$ grep applesauce words.txt
applesauce

However, using pipes has some crucial benefits, including that the command is more readable when we discern the file input (‘cat’) from the filtering (‘grep’).

Now, applesauce was a single match. But ‘grep’ actually returns all strings that contain the specified searchstring. There’s many entries that contain apple:

foo@bar:~$ cat words.txt | grep apple
appleberry
appleblossom
applecart
[...]

The resulting list also contains words like dapple or grapple. If you want to make sure that only words are found that begin with apple, we can use the ‘^’ symbol at the beginning of the searchstring. Likewise, a ‘$’ symbol returns only words that end with apple:

foo@bar:~$ cat words.txt | grep apple$
bakeapple
baked-apple
balm-apple
[...]

The reason these symbols work is that the searchstring is actually a regular expression. I highly recommend you to dig deeper into that topic to explore the full potential of ‘grep’.

In a next step, we can also iterate pipes to connect more than two programms. Let’s say, we want to find out how many words in the file contain the substring apple. We simply pipe ‘cat’ and ‘grep’ to ‘wc’:

foo@bar:~$ cat words.txt | grep apple | wc -l
101

Of course, this method is too coarse for morphological analyses since ‘grep’ also returns strings in which the string …apple does not represent the morpheme apple, as in dapple. So I recommend to always revise the results before going a step further. This will be the topic of the next section.

Writing the output to files

Inspecting the filtered results in your terminal output is probably not enough. As a linguist you want to go ahead and use the results for your ongoing studies. In order to do that, you can save the them to a file. You can direct the output of a command directly to a file by using the ‘>’’ command:

foo@bar:~$ cat words.txt | grep apple > apples.txt

Your current working directory should now contain a file ‘apples.txt’ that contains all the output that was filtered by ‘grep’. Pretty neat! Now you can go on, revise the results (e.g. throw out all non-apples like dapple) or use the results as input for other operations (e.g. scanning a text-corpus for words from this list).

A final example

Let’s say we are interested in the English suffix -scape. First, we filter out all words from the list that end on this sequence and write them to a seperate file:

foo@bar:~$ cat words.txt | grep scape$ > scape_candidates.txt

Here is the full content of the new file:

airscape
ascape
bathyscape
cityscape
cloudscape
dreamscape
escape
farmscape
inscape
landscape
manscape
moonscape
nonescape
offscape
outscape
preescape
pre-escape
re-escape
relandscape
riverscape
scape
seascape
sea-scape
self-escape
skyscape
snowscape
soundscape
streetscape
townscape
treescape
waterscape

Some of these words are derivations with the suffix -scape while others have this substring by chance. Of course, we can now investigate the list manually and eliminate all items that aren’t derivations with -scape. In other cases, the resulting list might be pretty long, so we want to continue in terminal to process our data.

To rule out non-derivations, we can check whether the substring before -scape is a real word. E.g. we can rule out escape since e is not an English stem that is a base for derivation. Let’s see how we can perform this test automatically.

In a first step, we strip the -scape suffix from all our words and see what’s left. We can do this by making use of the ‘sed’ command. The output is written to a new file:

foo@bar:~$ cat scape_candidates.txt | sed 's/scape//' > scape_candidates_without_suffixes.txt

The argument ‘s/scape//’ specifies that whenever a searchstring between the first /…/ is found, it is replaced with the content of the second /…/. In this case ‘scape’ is replaced with nothing. As a result we get the following list:

air
a
bathy
city
cloud
dream
e
farm
in
land
man
moon
none
off
out
pree
pre-e
re-e
reland
river

sea
sea-
self-e
sky
snow
sound
street
town
tree
water

Such a list can be used as input file for the ‘grep’ command. However, we first need to add word delimitations. We want to determine which of these strings are words on their own and not substrings of other words, so each line needs to be wrapped in ^…$ (see regular expressions above). Again, we can achieve this with the ‘sed’ command:

foo@bar:~$ cat scape_candidates_without_suffixes.txt | sed 's/^/^/;s/$/$/' > scape_candidates_patterns.txt

The argument of sed is a bit confusing but it basically means that a the beginning of a line (‘^’) is replaced with the ‘^’ character (analogously the ‘$’ for the end of a line). What we get is this:

^air$
^a$
^bathy$
^city$
^cloud$
^dream$
^e$
^farm$
^in$
^land$
^man$
^moon$
^none$
^off$
^out$
^pree$
^pre-e$
^re-e$
^reland$
^river$
^$
^sea$
^sea-$
^self-e$
^sky$
^snow$
^sound$
^street$
^town$
^tree$
^water$

Now we are ready to filter our wordlist for these strings. The ‘grep’ command can take patterns from files via the ‘-f’ argument:

foo@bar:~$ cat words.txt | grep -f scape_candidates_patterns.txt
a
farm
in
none
off
out
pree
reland
sea
water

So, these are candidates for stems that are subject to derivation with -scape. Let me add two comments on that. First, we see that there are false positives, which means that there are possible candidates that are clearly not part of a derivation. A case in point is a in the list. It stems from ascape (don’t ask me, I am not a native speaker) in the original wordlist. Certainly, the final ‘grep’ command finds the indefinite article a and therefore adds it in this list. One needs to delete such false positives manually in a final revision. Second, and much worse, there are also false negatives, which means that items were dropped incorrectly. Most obviously, we do not have land from landscape, but also riverscape and soundscape are missing. This has to do with the awkward fact that the simple words land, river and sound are not in the ‘wordlist.txt’. I didn’t make this list myself and I have no clou why such important words are missing, but it’s a perfect illustration for how careful we need to be with all fancy methodology: Your analysis are always just as good as the data it is based on. If your data is crap, you are not gonna make it far. Even if you use Linux terminal.