Coders have an interesting coping mechanism to deal with these sorts of things: program a bot to automate the addiction. This is a bit better than grinding for points; it’s a nice programming exercise that is slightly subversive and without intellectual property issues that could worry one’s employer.

During the design phase of the bot, I’ve become stuck on a little mathematical question that has led me on a journey from coding to mathematics via some impressively knowledge-dense internet resources:

How many different Bejeweled boards are there?

Match-three tile games like Bejeweled are played on a square grid with colored tiles. The game begins with a randomly generated board that satisifies the following conditions:

- No horizontal or vertical three-in-a-row runs of the same color
- At least one pair of neighboring colors can be swapped to create a horizontal or vertical three-in-a-row run

The figure below shows a legal 4-by-4 board on the left and an illegal starting board on the right.

Bejeweled is played on an 8-by-8 board using tiles of 7 colors at the start, so an upper bound on the number of initial legal boards is . This is about the number of atoms in ten thousand Earth-sized planets!

The colors of the tiles on the board do not actually define a unique board position, according to the rules of the game. Consider the figure below; the second board is a copy of the first board with its tile colors permuted.

The first two boards are identical with respect to which actions can be taken.

To avoid counting boards with permuted colors more than once, one can define a mapping that normalizes the colors on any proposed board, and then refuse to count the same color normalized board more than once. The normalization scheme used in this post assigns each color a number according to the order in which they occur on the board, from left-to-right and top-to-bottom. This is what the third board in the figure represents– notice how it captures the game-relevant information in both of the first two boards.

This representation is *somewhat* smaller than the original upper bound of . As a trivial proof of this, note that there is now only one possible color value for the upper-left tile instead of seven.

Perhaps a computer program could count the number of unique normalized boards? This JSFiddle is a Javscript program that enumerates all normalized tile sequences of a certain length. Notice how quickly the numbers grow from a board with 1 tile to a board with 13 tiles: 1, 2, 5, 15, 52, 203, 877, 4139, 21110, 115179, 665479, 4030523, 25343488. There is no way that this approach will scale to 64 tiles.

Faced with this computational hurdle, it is expedient to employ the primary tool of the curious modern slacker: internet search. A query to the Online Encyclopedia of Integer Sequences (OEIS) finds the sequence A099262. Turns out our sequence-of-interest shares the same starting entries as the partial sum of Stirling Numbers of the Second Kind:

Where and for the standard Bejeweled Blitz board. This is an easy equation to solve with the right tools, but it remains to be shown that the two sequences are actually identical.

Stirling Numbers of the Second Kind, written as , are the number of ways to partition a set of labeled objects into non-empty, unlabeled subsets. Below, we relate the things that this sequence is counting with the unique, color-normalized board instances.

Let each of the 64 tiles on the board be one of the labeled objects. Let the total number of different colors on the color-normalized board be the value of . For the moment, let’s count only the number of color-normalized boards containing exactly different colors. The quantity is the number of ways one can assign each tile to one of the color-related subsets.

The trick now is to show that there is a way to label the unlabeled subsets counted by such that the labels respect our normalization scheme. First, take the subset containing the first tile on the board and assign it the label “color 1”. Then, find the next tile on the board that is not in this set and assign it the label “color #2”. Continue until all subsets have been assigned color labels. Since counts only non-empty subsets, each of the colors will be labeled with a color number.

This should be a good start to a convincing explanation, although there are missing details that a real proof requires. The remaining work involves adding up the counts for all initial boards with only one color, only two colors, and so on up to colors. This is where the outer-most sum comes into play.

A WolframAlpha query `Sum[StirlingS2[64,i],{i,1,7}] in scientific notation` calculates the exact number of color-normalized boards. This rounds to , which is still an enormous number that is slightly greater than the estimated number of atoms in Earth. But it is *four orders of magnitude smaller* than our original upper-bound!

I hope to find the time to continue to chip away at this upper-bound by finding ways to throw away the color-normalized boards that don’t respect the match-three constraints. For the moment, I’m content to marvel at the crazy path from programming, through internet search, through actual mathematics, to winding up at a solution that can be evaluated in seconds in an accessible mathematical tool.

Match-three tile games like Bejeweled are similar to a class of two-player games like Tic-Tac-Toe, called mnk-games. It may be the case that techniques used to analyze the state space of these games are applicable to one player tile matching games. One possible starting point is the summary paper “Games solved: Now and in the future” by Herik, Uiter (PDF). A detailed history of match three games is also available, although it doesn’t address the mathematical analysis side of things: “Swap Adjacent Gems to Make Sets of Three: A History of Matching Tile Games” by Jesper (2007).

]]>when given only and . The straightforward algorithm first uses the exponential function to convert the log values to the linear domain representation and , then performs the sum, and finally uses the log function to convert back to the log domain:

These conversions between log and linear domains are slow and problems arise when is too large or small for a machine’s floating point representation.

Luckily, there is a clever approximation method that allows quick, accurate calculation with a relatively small precomputed table.

The table-based approximation can be derived by deciding to first look for a function such that:

Let’s consider the case where , without loss of generality. It looks like this:

Via algebra and some logarithm identities, solve for in terms of and :

The function can be pre-computed and stored in a table . A table is particularly apt because it is:

- Easily indexed as a function of , since this is a non-negative number
- Reusable across the entire range of , since it depends only on the relative difference between and
- Compact, since the contribution is small when

The table is pre-calculated as:

and used at run-time as:

Where a simple and quick-to-compute discretization function has been defined to round to table indexes. Let’s call the parameters sampling frequency (), rounding threshold (), and table size (). These affect the approximation accuracy and are described in the next section. For now, the figure below shows the original function and the table approximation when and :

Approximation error is affected by the table length and the discretization function. These are parameterized by the quantities:

- – table length
- – rounding threshold
- – “sampling frequency”

Since this blog post is stunningly long already, won’t be analyzed in detail. Predictably, increasing the sampling frequency decreases the approximation error at the cost of storing more samples in the table. Since the largest error occurs at the start of the table, increasing is the best way to lower the maximum error without changing the discretization function.

Longer tables are theoretically better than shorter tables. However, long tables can hurt cache performance and, after a certain point, the table contribution is so small that ignoring it barely affects error. Limiting the size of the table is a smart speed/precision trade-off.

It is possible to provide an upper-bound on the largest *useful* table when the answer to is stored in a limited precision representation. Consider that the smallest value that can be represented in single-precision floating point is . Be generous and allow all intermediate calculations to use arbitrary precision, and allow rounding the result before storing it in single-precision as the final answer. If the table element used contributes less than half of the smallest single-precision-representable value, this table element can never affect the stored result:

And solving for gives:

Which is the surprisingly small when .

The table index is proportional to . The parameter determines the discretization of this quantity, so that it can be mapped to an integer and used as a table index. When , the discretization operation is truncation. When , the discretization operation is standard rounding.

If fast evaluation is the one and only goal, use truncation by setting . Otherwise, consider the error due to the table approximation:

Which, for , looks like:

An looking at the error at the start of the table at particular values of gives:

One choice of would be the value that minimizes the total table error:

This integral is difficult to evaluate and minimize, in part due to the floor function and the discontinuities it introduces. Frequency and table length both affect the optimal value. The figures below illustrate the behavior of this equation. In an appendix at the end of this post, near-optimal values of are provided for different settings of .

The estimated optimal value of as varies:

Changing the logarithm base has the same effect on error as changing the sampling frequency . If the value of the logarithm base is not important in this part of the application, save a multiplication by setting and choosing the log base that achieves the same approximation error.

An attempt at an intuitive, practical explanation is that increasing maps the set of values to a larger set of integers. Changing the log base has the same effect, since:

That is equivalent to . Solving for the log base gives for .

I orginally found this table-based approximation method in the Sphinx 4 source code. It is useful in several places, one of which is in calculating the likelihood of a frame of audio data given a Gaussian mixture model (GMM). This calculation lies in what is effectively the inner-most loop of the speech recognizer. It runs tens of thousands of times each second and speeding it up is a worthwhile optimization.

The source code of LogMath.java and its javadoc documents this beautiful little algorithm well. It would have never caught my attention without that careful documentation. From what I remember, the documentation describes a derivation, some types of log-base tricks, and a similar idea behind how the table length can be bounded. This post builds on those explanations, with a slightly better-framed derivation, an analysis of that is completely new, and pretty pictures.

This post is also much, much longer. I kept finding new and interesting tidbits as this became my hobby for the month. I couldn’t bring myself to leave some of the gems unindexed. And there are still interesting questions left to be answered. Some of these depend on the details of different applications but there are also general questions like: is it ever worth making a 2 dimensional table to approximate the calculation ?

Good values of as varies, on the basis of minimizing error over the first 100 entries in the table:

lower-bound on error | ||
---|---|---|

1 | 0.588644 | 0.169006 |

2 | 0.54489 | 0.0861034 |

3 | 0.529998 | 0.057602 |

4 | 0.522519 | 0.043254 |

5 | 0.518022 | 0.0346227 |

6 | 0.515018 | 0.0288611 |

7 | 0.512875 | 0.0247426 |

8 | 0.511267 | 0.0216523 |

9 | 0.510017 | 0.0192478 |

10 | 0.509073* | 0.0173243* |

15 | 0.506008 | – |

20 | 0.504494 | – |

25 | 0.50357 | – |

30 | 0.502952 | – |

35 | 0.5025 | – |

40 | 0.502159 | – |

45 | 0.501895 | – |

50 | 0.501684 | – |

100 | 0.5009* | 0.00173275* |

200 | 0.500353 | – |

300 | 0.500245 | – |

400 | 0.50018 | – |

500 | 0.500143 | – |

600 | 0.500115 | – |

700 | 0.500081 | – |

800 | 0.500091 | – |

900 | 0.500019 | – |

1000 | 0.500062* | 0.0000950386* |

The numbers with asterisks were calculated over the first 1000 table entries. This gives a better estimate of the best value of for a table of this length or longer. With more entries, the optimal value of inches up, but not greatly. However, the error lower-bound is significantly affected when more samples are used, so only a few of the sufficiently-sampled lower-bounds were shown.

]]>But eventually my old Ubuntu distro wore out, and I decided to create my own Glacier “client” before upgrading. Its mission: upload one file to Glacier. No GUI, no command-line interface, not even a progress bar. Lo and behold, Amazon has a tutorial for exactly this. If you are familiar with the Java/Eclipse (or .Net) ecosystem, then you too can roll your own archival tool with very little time investment and no third-party vendor lock-in.

From memory, the steps were:

- Sign up for Amazon Glacier. It takes about 10 minutes to get access after signing up.
- Add a new “vault” in the AWS Glacier management console
- Download Eclipse IDE for Java EE Developers
- Unpack it, run it, install the AWS plugins by searching for “AWS” in the Eclipse marketplace
- Create an “AWS” project
- Copy your access key and secret key from your AWS account security web page into the “AwsCredentials.properties” file. Make sure you don’t source control this file.
- Copy the file upload tutorial code into a class
- Change the vault, the file path, and perhaps the description to whatever you want to upload
- “Run As” a Java application
- Watch the Eclipse console, see no exceptions, see the expected final output. Red messages may just indicate a network problem that the AWS libs have successfully overcome.

That’s it! There is a downloading example as well.

There are a few ways this casual archival process can fail to assuage moderate levels of paranoia.

First, whatever gets uploaded is potentially readable by rogue Amazon employees, crackers, and the government. Encrypt where you feel privacy is necessary. For me, this is in surprisingly few places. Perhaps consider light encryption on media archives to foil false positives from future copyright infringement detection attempts.

Second, beware that there is one very serious failure mode accessible to an attacker. It goes:

- Gain access to PC
- Gain access to AWS
- Delete Glacier vaults
- Delete local files (and other things like dropbox, gmail, etc.)

At this point, one is left with air-gap backups which hopefully spin up when asked. Relying on Glacier for a man-made disaster scenario means keeping AWS access secure. No local saved passwords. No keys that can be used to delete vaults or archives left scattered in Eclipse workspaces. Amazon has an AWS Identity and Access Management (IAM) service, which may be a way to create a “user” that cannot delete from Glacier?

]]>Piecing together tutorial code from httplib2 and Parse.com will give you the following:

import httplib2 header = {'X-Parse-Application-Id':'myAppIdHere', 'X-Parse-REST-API-Key':'myRestKeyHere'} url = 'https://api.parse.com/1/classes/myClassHere' h = httplib2.Http() resp, content = h.request(url, headers = header)

Executing this will produce the error:

httplib2.SSLHandshakeError: [Errno 1] _ssl.c:499: error:14090086:SSL routines: SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

The workaround is to bypass certificate validation when creating the Http object. It works, but compromises the security of our data.

h = httplib2.Http( disable_ssl_certificate_validation=True)

To take advantage of SSL, we need to get httplib2 using the correct root certificate. The first step in this process is finding the right root certificate to use. This is how I did it:

- In Firefox, go to https://parse.com
- Right-click and drill-down “View Page Info/Security/View Certificate/Details”
- Notice “DigiCert High Assurance EV Root CA”
- Google this to find the DigiCert certificates page
- Right-click and “save-as”:
`DigiCertHighAssuranceEVRootCA.crt` - Use
`openssl`to convert this to a PEM file as follows:

openssl x509 -in DigiCertHighAssuranceEVRootCA.crt -out DigiCertHighAssuranceEVRootCA.pem -outform PEM

This is the root certificate that can validate parse.com. Httplib2 can access it if you copy its contents into the certificate store for httplib2 or if you provide its filesystem path when when constructing the httplib2.Http object. Here’s the improved and final version of the above code:

import httplib2 header = {'X-Parse-Application-Id':'myAppIdHere', 'X-Parse-REST-API-Key':'myRestKeyHere'} h = httplib2.Http(ca_certs= '/path/to/DigiCertHighAssuranceEVRootCA.pem') resp, content = h.request(url, headers = header)]]>

The first step was to survey the current crop of HTML5/CSS/Mobile frameworks. Leading the pack is html5-boilerplate. This is a full-site solution, including things like a `.htaccess` with smart defaults and a build script that automatically optimizes html/css/js/images for the web. It has decent documentation, including several videos which I couldn’t bring myself to sit through, so I just dove in and started working.

Obtain html5-boilerplate from GitHub and create a branch for your site. The GitHub version is up-to-date and branching makes it easier to merge future updates. My branch is called “homepage”.

git clone https://github.com/h5bp/html5-boilerplate.git git checkout -b homepage

The main files you’ll modify for a simple static site are `index.html`, `css/style.css`, all the favicon-y images, and `humans.txt`. For `index.html` be sure to add your information to the title and description tags and modify the Google Analytics id section.

The CSS is based on progressive enhancement, where the default CSS is minimal and targeted at a low-resolution and bandwidth mobile device. This is the first section of CSS in `style.css`, and then for larger-screen devices (presumably with more bandwidth and processing power) you later introduce extra styling in the `@media` sections.

My site has a top-to-bottom linear layout of all items by default. As the browser window width increases, an image is floated alongside the text and we accept the extra overhead of loading a background image tile. I chose the background from the site Subtle Patterns and the color pallet with the help of Color Scheme Designer.

When the site is ready to go, use the ant build script to generate the web-optimized site. This will create a `publish/` directory that is ready to be placed on the server. On Ubuntu, the build required the Oracle JDK, rather than the OpenJDK that comes installed by default, to avoid seeing an error message.

I wanted to manage deployment via git, so that I would have an fast way to rollback changes and that didn’t require re-building the site. I created a git repository in the `publish` directory (it is `.gitignore`‘d by the parent project) and pushed it to my remote server. On the server, I cloned the repository directly into my web directory. The html5-boilerplate `.htaccess` prevents access to dotfiles, so the `.git` information is safe. I use the hash from the development repository branch for the commit message in the publish repository.

Unfortunately, by default, the html5-boilerplate build script deletes the `publish` directory, wiping out the git publishing repository. To fix this, I replaced the line in `build.xml`:

<delete dir="./${dir.publish}/"/>

with:

<delete dir="./${dir.publish}/" includeemptydirs="true" defaultexcludes="false"> <exclude name=".git/**" /> </delete>

The process for updating my webpage is:

- Make changes to development branch
`ant minify`- local:
`git add/commit/push` - remote:
`git pull`

srt(X,Y):- permutation(X,Y), isSrt(Y). isSrt([]). isSrt([_]). isSrt([H,I|J]):- H=<I, isSrt([I|J]).

The beauty of this solution is that Prolog does all the dirty work of finding a sorted `Y` for our given `X`. The darker side is that Prolog’s satisfaction mechanism carries out a brute force search through all possible permutations of the given list. This is dangerous; compare the runtimes of our naive sort and the Prolog built-in sort (which is decidedly less naive):

%5 seconds vs. something < 0.01 secondsreverse([1,2,3,4,5,6,7,8,9,10],X), profile(srt(X,Y)). reverse([1,2,3,4,5,6,7,8,9,10],X), profile(msort(X,Y)).

Not great, and notice how fast the worst case time of O(n!) grows:

%55 seconds vs. something < 0.01 secondsreverse([1,2,3,4,5,6,7,8,9,10,11],X), profile(srt(X,Y)). reverse([1,2,3,4,5,6,7,8,9,10,11],X), profile(msort(X,Y)).

This is to be expected; if we want to be picky about how the job gets done, we can be more explicit in our description. So let’s ignore performance and and see how to write a declarative “permutation” predicate from scratch.

Skip this section unless you’re interested in how a simple idea is translated into a complex and unattractive Prolog solution.

Abandoning all performance considerations for a kind of extreme descriptive minimalism, I considered the following definition for a permutation of a list: each item in the given list is in the permutation and each item in the permutation is in the list. As Prolog searches for all possible lists that satisfy this rule, it should produce all permutations of the given list.

I started by trying to define a `membr/2` predicate such that it would succeed when all items in the first list are in the second list. Attempts at extending this into something that would produce permutations caused infinite loops and stack overflows until I settled on this definition:

membr([H|T],Y):- length(Y,L), length(T,M), M<L, member(H,Y), membr(T,Y). membr([],Y):- allNonVar(Y). allNonVar([]). allNonVar([H|T]):- nonvar(H), allNonVar(T). srt(X,Y):- membr(X,Y), isSrt(Y), !.

Where `member/2` is a built-in predicate with a straightforward Prolog implementation.

The clauses dealing with length prevent `len(Y) < len(X)` if `X` contains duplicates. The predicate `allNonVar/1` prevents `Y` from having extra, un-unified anonymous variables. These can be left unbound when `Y` is forced to be a certain length, but duplicates in `X` always match to the first occurrence of the duplicate value in `Y`. These can also be left unbound when the `member/2` goal posits extra anonymous variables; `membr(X,[Y|_junk1,_junk2])` would otherwise be true.

Finally, the cut at the end of `srt/2` prevents the consideration of equivalent permutations of the input list when it contains duplicates.

The implementation of a simple definition of permutation ended up looking rather ugly because it had to fight the Prolog search mechanism.

A prettier implementation of a permutation predicate is given in Clocksin and Mellish, *Programming in Prolog*, Springer, 2003:

perm([],[]). perm(L,[H|T]):- append(V,[H|U],L), append(V,U,W), perm(W,T).

This states that a permutation of L is something that has as its head an element H drawn from L, and which is followed by something that is a permutation of the remaining elements of L. Most of the permutation definitions on the web are built like this one, but I like its use of `append/3`.

This blog post sketches how QuickSelect and QuickMedian work in general and in my Java implementation. The bulk of the post discusses the plot of QuickMedian vs. SortMedian’s runtime shown below (click the image for more details).

QuickSelect uses the partitioning step of QuickSort to divide an array into two sections around a pivot. All elements smaller than the pivot move before the pivot and all elements larger than the pivot move after the pivot. A new pivot is chosen *on the side of the partition that will contain the N-th element*, and the partitioning step is run again. It turns out that the number of elements visited on successive iterations of partitioning decreases so quickly in the average case that the overall runtime is expected , although worst case is .

QuickMedian was originally intended to be a demonstration of the QuickSelect algorithm. It turns out to be surprisingly tricky to write a robust median function! Much of the trouble comes from handling even-length arrays in the case of large or infinite values. When an array is of even length and the middle two values have large magnitudes, we need an averaging formula that avoids overflow. When the middle two elements are infinity, in some cases we must return `NaN`

.

Presenting arrays containing `NaN`

to QuickSelect produces undefined behavior, which seems like the correct behavior. What seems like incorrect behavior is present in Numpy, where `NaN`

sorts before other values. Here, the median value of a list can be decreased by inserting things which are not even numbers! Incorrect behavior may also result from a median calculation that uses Python’s native list `sort()`

method. Here, `NaN`

is not ordered with respect to other numbers, so the median of an array can change from call to call!

The final problem faced by QuickMedian was one of inefficiency. Even-length arrays require two calls to QuickSelect to find the middle two elements. This is inefficient if implemented naively, because the two calls operate independently over the full array. It should not be the case that runtimes vary greatly between arrays of length 1,000,000 and 1,000,001. To improve performance, QuickSelect was refactored to return the array indexes defining the smallest partition found that contains the N-th element. This partition is usually very small, and so the second QuickSelect call over this sub-array has negligible cost.

I wanted to examine the performance of QuickMedian, prove that it worked better than the sorting median method, and analyze the effects of different pivot picking methods. The wall clock time was accumulated over the course of several randomized trials, divided and sorted carefully, then plotted.

The evaluation consisted of 100 repetitions of trials in which several conditions were exhaustively varied. In each trial, the median methods processed a total of 1e6 elements, which were divided up into arrays of constant length. This number of elements was used because processing just one array of length 1000 was too quick to measure using the wall clock system time on my machine. The conditions which were varied across trials were:

- Length of arrays into which the 1e6 elements were divided
- Even or odd-length arrays
- Sorting of the arrays: random, sorted, reverse-sorted, mostly-sorted, mostly-reverse-sorted
- QuickSelect pivot-picking method: middle, random, median-of-three, median-of-randomized-three
- Whether the arrays contained duplicates

Array element values were consecutive integers; each integer was repeated under the “duplicate” condition. The mostly-sorted arrays were sorted but for the very middle 10%. Trials were completely randomized in a paranoid effort to eliminate JVM runtime optimizations.

The runtimes were divided into series and sorted in such a way as to show interesting systemic effects.

Sorting by array length addressed the main question of computational complexity and runtime speed. The QuickMedian runtimes remain roughly constant, while the SortMedian runtimes increase. Remember that the elements in each trial are divided into arrays each of length such that . QuickMedian has an expected runtime of , which is and thus expected to be constant across trials. SortMedian has an expected runtime of which is expected , which is expected to increase as the array length increases! The array lengths grow exponentially (something like 1e2, 1e3, 1e4, and 1e5), so the runtimes increase linearly. This, and the fact that QuickMedian’s runtimes lie well-below most of the SortMedian runtimes) is perhaps the central finding of the evaluation.

Then, splitting by the most interesting intra-algorithm condition showed further systemic effects. The runtimes from the QuickMedian algorithm were divided into series based upon the pivot picking method used, while the SortMedian runtimes were divided by how the array was initially sorted.

The SortMedian results show that my platform’s Java sorting algorithm works fastest on reverse-sorted arrays. The ripples in these series are interestingly due to the presence or absence of duplicates.

The QuickMedian results confirm that the fastest runtime is given by the deterministic pivot picking methods on sorted arrays, where the middle element is almost immediately found. I was surprised to see that the randomized median-of-three pivot picking method never outperforms the deterministic median-of-three pivot picking method. However, it makes sense because median-of-three killer sequences are probably unlikely. I haven’t looked into this behavior yet, but I am assuming that the randomized median-of-three pivot picking method is more resistant to naturally-occurring killer sequences, so it will remain the default pivot picking method in QuickMedian.

QuickMedian is clearly the fastest way to find the median, and should be used particularly when the length of input arrays is at least 100 or greater. If throughput performance is important, you are operating on safe and mostly-sorted data, and the possibility of a worst-case O(n^2) runtime won’t dissuade you: consider switching to the non-randomized median-of-three pivot picking method when constructing the QuickMedian object.

]]>
A farmer was planting to reap,

Wished to profit the max on the cheap.

Trade-off soil, spacing, and blight,

Planting time, market value, and light,

But: too-complex his goals were to meet.

“Perhaps I can ask my dear friends,

To plant blindly what I recommends.

Select crops and place them by chance,

Grow, harvest, and then sell those plants.

The free market enforces the trends!

Some will make it and some will not,

Dupe success in the fallow lot,

But to all I’ll suggest

To mingle, and for zest,

To randomly swap partial plots.

Farmhands are a hard working breed.

They plant and they tend and they weed.

While it may be rarer,

A job done in error,

Begats a season novelty’d.

Perhaps not me nor my son,

Nor his daughter nor boss ADM,

But eventually one day,

The min work for max pay

Will pervade, then we’re finally done!”

The one-liner below is overkill for this task, but the extra arguments provide a starting point for modification for related tasks. The use of `find`

/`xargs`

should help the one-liner deal with very large numbers of audio files and filenames that contain whitespace.

find . -maxdepth 1 -name '*.wav' -type f -print0 | xargs -0 -t -r -I {} sox {} -r 8000 8000Hz/{}

An explanation of commands and flags from left-to-right:

`find .`

- searches the current directory subtree
`-maxdepth 1`

- excludes subdirectories
`-name '*.wav' -type f`

- outputs files that match the glob
`*.wav`

`-print0`

- first step in handling filenames with spaces
`xargs`

- executes a command given the output of
`find`

`-0`

- second step in handling filenames with spaces
`-t`

- prints on stdout the command that is to be executed
`-r`

- runs SoX command iff
`find`

output is generated `-I {}`

- specifies the pattern “{}” is replaced by the output of
`find`

`sox {} -r 8000 8000Hz/{}`

- is the resampling command

I also generated a new, full set of adaptation prompt data from the CMU ARCTIC prompts.

First download and build SphinxBase. Since I had a relatively new Ubuntu install, I had to `sudo apt-get install autoconf libtool bison`

.

svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/sphinxbase cd sphinxbase ./autogen.sh ./configure make make check sudo make install sudo ldconfig -v

Thank you to Mnemonic Place for figuring out the last step (with `ldconfig)`

. Without this, you’ll see errors when trying to run the `sphinx_fe`

command, later.

svn co https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/SphinxTrain cd SphinxTrain ./configure make

Record/obtain the 20 Arctic WAV files that are used in the CMU howto. Because the Communicator acoustic model is 8kHz, I recorded the WAV’s as 8kHz, 16-bit in Audacity and used the “multiple export” function to batch save. I `rename`

‘d them to match the format that is compatible with their howto, e.g., “arctic_0001.wav”. (Not quite as much fun as the Dave Barry passage that you get to read during Dragon’s adaptation process :] )

Download the four howto training files from the CMU wiki page. One of the files is listed as “arctic20.fileids” but saves as “arctic20.listoffiles”; I just renamed it to “arctic20.fileids” so as to not cause confusion with later scripts and instructions. If you need more than 20 prompts for adaptation, you may like to draw them from the full set of CMU ARCTIC prompts I generated.

Now move the four training files, your 20 arctic WAV files, the acoustic model you’re going to use to one directory. My acoustic model is in a subdirectory named `Communicator_40.cd_cont_4000/`

. In my case, that one directory is a sibling of the sphinxbase and SphinxTrain directories.

sphinx_fe -argfile Communicator_40.cd_cont_4000/feat.params \ -samprate 8000 -c arctic20.fileids -di . -do . \ -ei wav -eo mfc -mswav yes

First copy over binaries from SphinxTrain:

cp ../SphinxTrain/bin.i686-pc-linux-gnu/bw . cp ../SphinxTrain/bin.i686-pc-linux-gnu/map_adapt . cp ../SphinxTrain/bin.i686-pc-linux-gnu/mk_s2sendump .

Find a copy of the “fillerdict” you use and copy it to the acoustic model directory, renaming it to the filename “noisedict”.

Run the Baum-Welch program:

./bw \ -hmmdir Communicator_40.cd_cont_4000 \ -moddeffn Communicator_40.cd_cont_4000/mdef \ -ts2cbfn .cont. \ -feat 1s_c_d_dd \ -cmn current \ -agc none \ -dictfn arctic20.dic \ -ctlfn arctic20.fileids \ -lsnfn arctic20.transcription \ -accumdir .

Note that MLLR isn’t supported in Sphinx 4, so skip this step. “The Heiroglyph” document makes a point that you can iteratively build the mllr_matrix by running `bw`

, then `mllr_solve`

, then `bw`

, then `mllr_solve`

, etc. If you’re using a system that can take advantage of the mllr_matrix, you can run `bw`

after you have your final mllr_matrix to generate MLLR-adapted statistics for the MAP adaptation described later.

cp ../SphinxTrain/bin.i686-pc-linux-gnu/mllr_solve . ./mllr_solve \ -meanfn Communicator_40.cd_cont_4000/means \ -varfn Communicator_40.cd_cont_4000/variances \ -outmllrfn mllr_matrix -accumdir .

When the mllr_matrix exists, you can re-calculate statistics as mentioned above with the command:

./bw \ -hmmdir Communicator_40.cd_cont_4000 \ -moddeffn Communicator_40.cd_cont_4000/mdef \ -ts2cbfn .cont. \ -feat 1s_c_d_dd \ -cmn current \ -agc none \ -mllrmat mllr_matrix \ -dictfn arctic20.dic \ -ctlfn arctic20.fileids \ -lsnfn arctic20.transcription \ -accumdir .

Sphinx4 can benefit from MAP training, but it is labor-intensive at the moment. MAP training requires accurate transcripts of the adaptation prompts, and may degrade performance if the wrong dictionary pronunciation is used for an audio recording. For example, there are two pronunciations for “a” in the cmudict: “uh” and “ay”. Unless the transcription is annotated with the correct varient (“A” or “A(2)”), the MAP training can degrade the acoustic models involving that phoneme.

The suggested solution to this problem is to either perform forced alignment or hand-transcribe the data. I haven’t tried methods for force-alignment, yet. Human annotators could use a file included in my version of the ARCTIC prompts: it lists all alternative pronunciations inline in the transcription file. This is slow-going and error-prone, but it may be useful if you’re trying to adapt an acoustic model for personal use?

In any case, the commands to perform the MAP adaptation are below.

cp ../SphinxTrain/bin.i686-pc-linux-gnu/map_adapt . cp -r Communicator_40.cd_cont_4000/ Communicator_40.cd_cont_4000.adapted ./map_adapt \ -meanfn Communicator_40.cd_cont_4000/means \ -varfn Communicator_40.cd_cont_4000/variances \ -mixwfn Communicator_40.cd_cont_4000/mixture_weights \ -tmatfn Communicator_40.cd_cont_4000/transition_matrices \ -accumdir . \ -mapmeanfn Communicator_40.cd_cont_4000.adapted/means \ -mapvarfn Communicator_40.cd_cont_4000.adapted/variances \ -mapmixwfn Communicator_40.cd_cont_4000.adapted/mixture_weights \ -maptmatfn Communicator_40.cd_cont_4000.adapted/transition_matrices

And that’s it! All of these commands should terminate within about one second.

To test whether or not these commands actually did anything, I generated some new/old recognition results. Test data were the WAV files used for adaptation and I used the 5K NVP 3-gram ARPA language model avaiable from Keith Vertanen’s site. The acoustic models were the original Communicator model and the MAP adapted model (but I didn’t use the MLLR transform).

This doesn’t tell us anything definite about the performance of the acoustic model in our application, but it shows that the adaptation did do something. Below are the first few recognition results: the first line is using the original acoustic model (OLD), the second line is using the MAP-adapted acoustic model (NEW), and the third line is the gold-standard transcription (GLD).

OLD: all circuit egypt around philip still successor

NEW: author of the danger trail philip feels it set or a

GLD: author of the danger trail, philip steels et cetera

OLD: not efficiency killer case tom unfair work

NEW: not at this particular case thomas politics with more

GLD: not at this particular case tom apologized whittemore

OLD: further twentieth time at evening into mexican

NEW: for the twentieth time that evening the two men sugar

GLD: for the twentieth time that evening the two men shook hands

A document called “The Hieroglyph” talks a bit more about adaptation and makes a few good points about the number of utterances and the care with which they are transcribed. Some suggestions from that document were incorporated into this post.

Good luck!

]]>