Estimating German Tank Production During WWII … with Power Query?
Last night I attended my first Veterans in Data Science & Machine Learning (vetsindatascience.com) meeting. It was pretty interesting. There was a presentation about a Google GenAI product that was neat. We also talked about a World War II era paper that discussed how Allies estimated German Tank production during the war. Guess which topic I was interested in?
We went over the background of the paper, which in basic terms described how mathematicians were able to estimate production based on serial numbers of certain parts of a German tank. Initially, intelligence reports said more than 1000 tanks a month were being produced. Mathematicians were able to estimate (correctly) that the actual number was probably well short of that.
The presenter was able to show how this worked with some Python script. While it was a very simple probabilistic method to reach the estimate, it still gave more or less a good gauge in terms of what the actual result was, given a sample of the population.
After the meeting, someone posted the following video from Youtube that explains this really well. Here is that video:
Of course, I tried to figure this out in Power Query.
First Step: Generate Random Numbers
First order of business was generating random numbers with Power Query. I’ve seen a number of tutorials on how to do this, but I wanted to try and figure one out that utilized List.Random(), since the second optional parameter is a seed. If you are familiar with Numpy or similar libraries in R, then you would know that a seed is a number you can use to generate the random list and later on be able to reproduce it.
One issue that presented itself was that at times the list would be less than the sample we were looking for. So if I did a random sample of 5 items in a list of 25, at times the function would return a list of 4 items. I attempted to iterate with a scope parameter, but with larger samples it grew increasingly less efficient. So, I feel like it may be easier to just generate two lists and union with the function as opposed to creating an inefficient loop.
Here is what I came up with:
(sample as number, seed as number, max as number)=>
let
// how many zeros do we need based on max value
len = Text.Length(Text.From(max))-1,
// generate zeros as text
zeros = Text.Repeat("0",len),
// append after a 1 and then connvert to number
places = Number.FromText("1" & zeros),
// generate random numbers
generate =
// Select numbers less than or equal to max
List.Select(
// Sort ascending
List.Sort(
// Get distinct numbers
List.Distinct(
// Generate random numbers and transform
List.Transform(
List.Random(sample, seed),
// Convert to integers
each Number.Round(_ * places,0)
)
),
Order.Ascending
),
each _ <= max
)
in
generate
Second Step: Tank Estimation
Next step was to replicate the formula in the video and run against a sample. That was more straightforward:
(observations as list)=>
let
max = List.Max(observations),
k = List.Count(observations),
estimate = max + (max-k)/k,
rounded = Number.Round(estimate,0)
in
rounded
I generated the sample list with a sample of 25 out of 1000 possible and a seed of 1945:
Next I rank the formula against that list:
I thought it was pretty accurate. If you interpret the first list as being a sample of tanks out of 1000 tanks produced, you can see that the final number is only off by six.
Conclusion
It was fun to take apart the math (in a basic manner) behind sampling and the estimation. Overall, the math still makes a lot of sense in situations where you don’t have a lot to go off of. Will I use it a lot? Maybe not. But I may take a crack at sampling again.
Comments are closed.