Laguna Wednesday, November 11, 2015

Generating Names

What is a name? What makes it unique? What makes you easily identify it as such? How can we intuitively see from which cultural domain it emerges?

PROCJAM Names

For this week its PROCJAM. There are many resources and tutorials available on how to generate levels (like heightmaps or connected rooms). But when it comes to fill these levels with life and personality, not that much information is available. Nevertheless it is a requirement to generate Names for persons, cities, locations, enemies or anything in your game that should be identified uniquely. This task becomes even more interesting, when you want to have distinct names for different cultural or social domains or races. Of course your elvish cities should be named differently from your orc warriors.

Three Methods

In this post I will present three distinct ways to generate names. I will start with the most rigid way and end with the most flexible way. I implemented the second and third method in a C# program. The source code is available on Bitbucket.

Method 1: Lookup lists

The most simple (and not very procedural) way of creating names is just to have a list of names. When you need a name just randomly pick one from the list. If you want to avoid duplication just pop the name.

For different cultural domains it is feasible to have different lists. On the internet there are various lists of names, that can be used free of charge and will yield a large amount of input.

The big drawback of this approach is that you are limited to the names on the lists. Especially if you have some special requirements.

typo screenshot

For Example, for the game Typo I needed names that are more or less difficult to type and thus had to maintain different lists and could not just copy and paste some reference lists.

One large list (most common US citizen names) that I used for my name generation can be viewed here: Name List (bitbucket)

Method 2: Grammar rules

If you want to go one step further and really get a large amount of unique and procedurally generated names, you can use what is called grammar rules. For this you manually specify some rules on how names should be built out of basic blocks.

The most basic building block of names are single characters. But when you start to roll your name by appending randomly selected chars, you will get strange names like "grtnsa" or "mhowz". So we need a little bit more attention to detail here. What works very well is to divide all characters into groups, like consonants, vowels and special characters (inverted commas, e.g. for orc city names).

Then think about an arrangement of this building blocks. What works nice is to alternatingly use consonants and vowels. This will yield reasonable names that can mostly be pronounced correctly. The following list presents some five letter names created by alternating consonants and vowels.

Negic Tadal Caten Fahot Nafot Baber Lenal Warog Firig

Of course you can also create names of other length. If you want to do even better, think about the following: Not all letters of the alphabet occur with the same frequency. Therefore it is useful to aid the generator with a little help in the right direction. On Wikipedia you can find a frequency table that can easily be used to mimic this behaviour.

This Method works even better if you use some larger building blocks. Like fixed suffixes

il, en, ith, son, ...

that can be concatenated to the names above. Some of them work better with three or four letter names.

What you can also do is some consonant duplication in the middle of the name. Therefore you need additional information, since some consonants can easily be doubled (like "n" or "t") while some can not (like "h" or "x"). Adding duplication the middle consonant in the upper list (with exclusion) and appending some suffix will yield

Neggicil Taddalen Catten Naffotson Babberman Lennalen Warrogil Firrigson

which are also perfectly fine names. Note how on Catten no suffix has been appended, since the last two characters already match one suffix. I encourage you to create your own set of rules. The possibilities are endless though not all rules will generate sensible names.

The big advantage is that this method allows you to generate random names by using some simple rules. The disadvantage is to create distinct cultural domains. A lot can be done with the suffix but again this is some kind of prepared list of words.

Method 3: Monte Carlo names

Monte Carlo methods describe a class of computational algorithms that are based on random decisions. The basic idea for this topic is to start with a name that is composed of completely random characters, which is most likely a bad name. Iteratively throughout the Monte Carlo process the name becomes better and better until it is good enough to hold as a name for a person, city or anything else.

To define bad or good names, we need some criteria, which can be trained by the lists mentioned earlier. This criteria can be anything you like (e.g. number of consonants in a name or how often the same letter appears twice or more). Here I will use two simple concepts: Pairs and Correlations. A pair of length N is just the sequential occurrence of N characters. As an example

Name: helen
Pairs N=2: he el le en
Pairs N=3: hel ele len
Pairs N=4: hele elen

A Correlation just looks N characters forward in the name and reports which chararacter follows in a certain distance

Name: helen
Correlation N=1: h?l e?e l?n
Correlation N=2: h??e e??n

Where a ? denotes any arbitrary single character.

Armed with Pairs and Correlations and a list of names you can now create the rating function. By analysing all names in the list you will get some frequency table for Pairs and Correlations. For every occurrence (like in the "helen" example above) you add 1 to the respective Pair and Correlation count. As the input data have been real names you want to mimic the behaviour of those names and thus have criteria for good names. The rating function just checks the name it is given and looks for any Pairs and Correlations it has been trained with. For each hit it adds the number that is according to the Pair or Correlation that has been found.

Now we have the rating function and can start creating names. This is like rolling consonants and vowels according to the letter frequency table in the grammar method. You start over with a completely random name. Calculate the current rating of the name. Then exchange a random character in the name with another random character of the letter frequency table. Then calculate the new rating. If this value is better than the old rating, accept the change, otherwise refuse it and undo it. This should be done in a loop (you should make sure that each character has been touched at least some couple of times). After about 100 iterations you will have a very nice name. As Monte Carlo algorithms are iterative, the longer it runs the better the results will get.

Some names that have been created with this method are

Lana Eran Anaara Alan

An interesting fact is that you can feed the rating function lists with different cultures and will get names that follow that cultural rules. (notice how the training with Chinese city names also added ' to the name list).

Orc names: Ogrugu Kug Duguk Trugga
Chinese cities: Qhang Yangua Angsho Ing'ang

The drawbacks of this method are, that you need a list of names that you can feed in the rating function. Also it can be quite expensive to do all those iterations. The number of possible pairs with N=2 are 24*24 (number of characters for the first and the second letter multiplied) are 576. For Pairs with N=3 it is already 13824 and for Pairs with N=4 it is something over three million. Let's assume that only 5 percent of the combinations will occur. Then there is a massive amount of substring checking to be done. If you do many iterations (remember, that is what you need to do to get good names) this can take some time.

Nevertheless the Monte Carlo Name Generator is my favorite as it can produce huge varieties of very nice names that follow a certain cultural trend and is truly procedural.

Cleaning up

As you probably noticed, I did not investigate the questions proposed at the beginning. So far I am still a physicist and not a linguist. But at least I try to answer those questions in the way I have learned (namely statistical methods, numerics, data analysis and correlation functions). The proposed simple criteria for the rating function work surprisingly well - at least for me. It is also stunning that by using a really simple alternating grammar it is feasible to create such names.

I hope you enjoyed reading this tutorial on how to generate names procedurally.