1 Ekim 2012 Pazartesi

Does the Code Work? Building Confidence in Mathematical Models

To contact us Click HERE
Suppose you have built a spreadsheet valuing abillion dollar company for a potential merger or acquisition deal. Howcan you validate the numbers you get from it?

Suppose you have written a computer program supporting aPh.D. dissertation or scientific journal article. Again, how can youvalidate the results you get from it?

Last time,in AnEvolving Ecosystem Game we built an artificial-life ecosystemor agent based model in which agents (creatures) evolvedstrategies of cooperating or fighting over multiple generations. Andthere too, we asked, Does the Code Work the waywe intended?

All three of the examples share a common thread: the calculation issufficiently complex that it is not obvious what the right answer is,while at the same time, getting the right answer is central to whatyou are trying to accomplish, and your job/reputation may be on theline. Moreover, there is no "answer in the back of the book" that youcan refer to: you may be the first person ever to do this specificcalculation!

It is even worse if you are only managing the personwho actually wrote the code/performed the calculations. Depending onyour level of involvement and background, you might have only a highlevel understanding of the goals of the calculation, rather than adetailed knowledge of how it should be carried out, or you may beuncertain about the technical competence of the person doing the work.

Today I will try to give some suggestions on how to approachthese kinds of issues, based on my past experience in each of theabove situations.

There are really two distinct questions:
  • Model Choice: Did I translate the real world situation into an appropriate mathematical model?
  • Model Validation: Did I correctly implement the mathematical model in a computer program?
Except in toy situations, no model will correctly capture all thenuances of real physical, biological, ecological, business or socialsituations. You always have to make approximations and leave outdetails. This has led to a plethora of modeling approaches, eachfocused on different aspects of a situation. For instance, SystemDynamics focuses on change over time, whereas Game Theory focuses onstrategic choices. A real situation may have elements of both; part ofthe art of modeling is having enough experience with a variety ofapproaches to be able to select an appropriate one given the time,money and other constraints at hand.

A word of advice: be wary of analysts who only know onemethod. When your only tool is a hammer, all problems look likenails. The master mechanic has a whole garage full of different kindsand sizes of tools, because she knows that "the right tool for theright job" makes everything so much more successful

The Model Choice and Validation questions are closelyrelated. Some validation failures turn out to stem from a poor choiceof mathematical model, rather than a bug in the code. Conversely, somemathematical frameworks are much easier to implement correctly in codethan others. And peer review (discussion with a knowledgeable thirdparty) helps to address both questions as well.

However, for today's article, I will assume that you havealready chosen an appropriate mathematical model, and are mainlytrying to validate its computer implementation. In the future I maydo an article comparing several different modeling approaches in moredetail.

There are a lot of potential ways to address the validation question,some of which may or may not be possible depending on the situation(e.g. constraints on time, money, or authorship). Today we willdiscuss three:
  • Re-implement from scratch: Have someone else do their own calculations independently, or write a second implementation of the calculation yourself, from a clean sheet, deliberately approaching the problem from a different angle.
  • Perform confidence-building tests on the existing model.
  • Write a "white paper" explaining the model and documenting the assumptions as clearly as possible, as if you needed to teach someone else how to do it.

Re-implement from scratch

Detecting errors in spreadsheets (and other kinds of computerprograms) is difficult but important. Panko (1998, 2008) summarizes a variety of studies by a variety ofauthors highlighting the error rate in actual spreadsheets. Forinstance, he cites a 1998 study by the giant accounting firm KPMG inwhich 91% of 22 audited spreadsheets contained "significant" errors.

When a lot of money is at stake, it makes a great deal ofsense to have someone else make their own version of the calculation,independently, to see if the results are similar. Depending on thesituation, this might be a friend, colleague, assistant, your boss, oran outside consultant. The key is that the person have sufficientexpertise to do a credible job without a lot of input from you. Itserves no purpose if they work hand in hand with you developing aclone of your original model (spreadsheet, source code, whatever). Youneed to communicate to them the essential background: the data andassumptions that go into the calculation - but not thespecifics of how to implement or perform the calculation.

This approach can be very powerful, but can also be expensive in timeand money. It is most applicable when a lot of money is riding on theanswer. For the billion dollar M&A deal, I had a team of MBAsfrom Chicago developing an Excel spreadsheet model while I developedmy own version independently in a programming language similar toPython. Every so often we compared results and talked about why theymight be so far apart. These discussions revealed all sorts of hiddenassumptions, for example about the timing of certain events or howbest to model some uncertain future cash flows. Both models benefitedfrom these discussions. Eventually the models gave results that weresimilar enough that we could feel confident they both reflected ashared understanding of the future situation of the company, and weused them as key inputs in the subsequent negotiations.

When time, money, or authorship requirements prohibit using asecond person (e.g. when writing your Ph.D. thesis), anotherpossibility is to create a second implementation yourself, asindependently as possible. For instance, you might write it in adifferent computer language. Or you might use the same language, butredesign the architecture of the program. This is not nearly aswasteful as it often sounds to managers; anyone who has developed acomplex piece of code knows that in real projects, the scopeconstantly creeps, so the final code winds up a patchwork of additionsnot part of the original plan. I have found that rewriting the thingfrom scratch takes very little time compared to writing the originalversion, and allows building a much more comprehensive and robustarchitecture based on all the things you learned the first timearound. The resulting program will typically have fewer bugs, offermore features, and run faster than the original. If it also givessimilar answers, that boosts the credibility of both versionsconsiderably.

Perform Confidence-Building Tests

Whether you have one version of the code or two, whether you wrote ityourself or hired someone else to write it, ultimately there is nosubstitute for testing it. There is something of an art todesigning appropriate tests. It helps to have previous experience, sothat you know what you are looking for. On the other hand, you alsoneed to keep an open mind: if you will only accept a model thatreproduces your going-in expectations, then there was no point inbuilding it - there is no chance you will learn something new. Thusmany useful tests involve looking for patterns of results thatare consistent (or inconsistent) with the underlying topic, even whenyou do not know exactly what to expect.

There are a great many possible tests you can try, and in general youwill invent many more yourself, specific to yoursituation. Nonetheless, some things are almost universally worthtrying, including
  • Extreme Value Testing
  • Sensitivity Testing
  • Testing Isolated Portions By Hand

Extreme Value Testing

My favorite kind of test is the extreme value test. Set aparticular parameter in your model to a very unusual value, andsee if the output makes sense. It is almost always the case that inthe extreme, you know what to expect. Depending on the parameter, an"extreme" value may be a very large positive value, or a very negativevalue, or something very close to zero. The goal is to see if you canbreak the model by supplying values outside the range of "typical"ones; if you can, it raises red flags about the math model or theimplementation.

For example: suppose we are modeling the future operations ofa company, to see how much it is worth. We are uncertain about theattractiveness of its future products to consumers, given thecompetition in the market. There will doubtless be a parameter in themodel that explicitly or implicitly controls this attractiveness; seewhat happens if you raise or lower it a lot. Ultimately, thevalue of the company should fall dramatically if no one wants theproduct, and should rise significantly if everyone wants it. But thecurve should be S-shaped: you can never have negative sales, and youcan never sell to more than 100% of the market (unless your productis so amazing that it also grows aggregate demand), which limitsboth the potential upside and the potential downside. If the modelproduces negative sales, or if revenue hits a trillion dollars, youknow that something is not right. This does not provethat the model produces wrong answers when you use more realisticinputs, but it certainly raises concerns.

Moreover, it will be relatively easy to look at the equationsinside the model and determine why it produced such a wrong answer,precisely because the answer is extreme. When you plug in "ordinary"inputs and get "ordinary" looking outputs, it is much harder to telljust by looking at them if something is wrong.

Following on this example, you can create your own extreme valuetests. Just ask yourself, "what would happen if such andsuch extreme event occurred?". These tests quickly demonstrateholes in the logic of the model - places where it does not match upwith reality. Whenever possible, such holes should be filled, not byforcing the results to be more realistic, but by addressing the logicgap.

To take a highly oversimplified example: suppose you model sales S as afunction of price P by a linear equation like
S = (50-P)*V
where V is a positive constant. The idea is that lowering price will raisesales, and raising price will drive customers away.But, raising price above 50 results innegative sales. That's bad. Trying to fix it with a patch like
S = max(0, (50-P)*V)
misses the point. While it may prevent the symptom (negative sales arechopped off at zero), it does not fix the root cause (an overlysimplistic model of sales depending linearly on price). Moreover, bymasking the symptom, what you are really doing is making it harder forsomeone to discover other flaws. For instance, a related flaw inlinear models like this one is that the "elasticity" (sensitivity ofcustomers to price) grows too large too quickly as pricerises, even before you reach P=50. The negative sales are justthe tip of the iceberg. A much better solution is to use a morerealistic model of sales, starting with a simple logit choicemodel
S = V*exp(-a*P)/(1+exp(-a*P))
and potentially ending with something much more sophisticated, like a"mixed logit" model calibrated to market research conducted on a largesample of potential consumers. Ideally, such a model will include thecompetition and their prices, as well as competitiveresponse, which is the notion that as you modify your price, thecompetition may to some extent also modify theirs, and vice versa.

Sensitivity Testing

It is certainly also useful toconduct a sensitivity test. This is where you vary every inputparameter up and down by a small amount, such as 10% (orby some modest fraction of their reasonable range). You thensort the parameters according to the size of the impact of the changeon the output metric of interest, e.g. NPV or market share orwhatever. Then - and this is where the art comes in - you look at thesorted list and try to decide if something is wrong.

Things can go wrong in several ways. You should generally know thecorrect sign of the effect for every parameter,and have a rough sense ("business judgment") for which factors willplay a major role and which a minor one. Incorrect signs are clearerrors in the model. So are parameters that should be important butturn out to have very little impact, and vice versa.

To once again use a highly oversimplified example: a simple equationfor profit to start from revenue and deduct expenses andinvestment costs: V*P-V*C-I,where P is the sale price per item, V the volume (number of itemssold), I the investment (the cost to build the factory that makes the items),and C the unit cost (material and labor per item).But suppose you inadvertently switched C and I, resulting in theformula: V*P-V*I-C.

Now, you conduct a sensitivity test. Suppose in the "basecase", P=3, V=10, C=1, I=2, so the base case profit is 9 according tothe incorrect model.

Now we do a sensitivity analysis. Raising P by 10% to 3.3 raisesprofit by 3, to 12. Raising V by 10% to 11 raises profit by 1, to10. Raising C by 10% to 1.1 lowers profit by 0.1, to 8.9. Raising I by10% to 2.2 lowers profit by 2, to 7.

Now we sort them into a table by impact:

FactorChangeImpactImpact/Change
P0.3310
V111
C0.1-0.1-1
I0.2-2-10

All of the signs look correct, but since the investment is a one-timecost, its impact should be one-for-one on the profit. In contrast, the per-unit cost C gets multiplied by volume, so its impact should beten-to-one. Thus, we see immediately that something is wrong with C and I.Of course, after we fix the formula, we need to repeat all thetests, since everything is now completely different (e.g. the base caseprofit is actually 18, not 9).

Notice how much easier this error would have been to spot with anextreme value test. Drop price to match the per-unit cost: then profitshould be the negative of investment (-I), i.e. we lose the investmentsince we make zero incremental profit on each unit sold. But in fact,setting P=1 in the flawed model produces a profit of -11, much worsethan the -2 we expected: bingo.

That's why I like to start with extreme value tests. But sensitivitytesting definitely has its place. For instance, if it shows that aparameter has zero impact, you know something is wrong: youforgot to implement an important relationship, or, worse yet, someoneentered a "0*" in a formula to temporarily suppress a relationshipduring the debugging phase, and forgot to remove it. I've seen ithappen more than once over the years.

Testing Isolated Portions By Hand

A third method of testing that is extremely useful in debugging codeis to carve out little pieces of the program to test on their own, andwork out with paper and pencil what the correct answer should be.

I will illustrate this method in some detail on the agentbased simulation model from last time, in which a population of agentswho could cooperate or fight evolved over time based on survival ofthe fittest. You can skip ahead now to the section on Writing a White Paper if you do not want to see the details;otherwise, you may want to review AnEvolving Ecosystem Game, since we will, by necessity, delve deeplyinto the details of how the code works.

Recall that in the "base case" (the version we made last time), most of the agents evolve to have r near one (blue), and s nearzero (red). This meant that most of the agents adopted a strategy offighting most of the time, regardless of their recent history ofwins and losses. Is this plausible? We fundamentally do not have a lotof intuition here, since the results are likely to depend quite a loton the specific payoff matrix (the "Game" table), on the frequency ofmutations, on the arrangement of agents on the grid, and so forth.

How then can we build some confidence in the simulation?

Well, as mentioned above, one way would be to find someone else whohas or will build a similar simulation, and see what they come upwith. There is an academic field of Evolutionary Game Theory,and it may well be that someone in it has already written a paper onsimulations of this sort. Similarly, we could write a white paper -such as this pair of blog articles - to help document the approach andclarify how the code works.

However, the most interesting approach will be to run sometests.

One good place to start is by testing the various functions inthe code individually, to make sure they do what wewant. The score and breed functions lend themselves tothis approach. Load the code from last time and run it. Close thegraphics window to stop it. Now, in the Pythoncommand shell window, we can type
score('f', 'c')
and verify that it prints 2, matching the score from the "you fight,they cooperate" line in the Game table. This works because the code(and the simulation data) are still loaded into Python and availablefor us to examine. It is good to check all 4 cases, either by typingthem all out, or using nested loops to do it for us:
choices = ['c', 'f']for ch in choices:  for ach in choices:    print("%s+%s -> %s\n" % (ch, ach, score(ch, ach)))

Next, let's see what happens if we call the breed function 1000times on the same arguments, such as 0.2 and 0.4. It issupposed to usually produce 0.2, 0.3 or 0.4 (the min, average, or maxof the two inputs), and occasionally something random (a mutation).
out = {}for i in range(0,1000):  z = breed(0.2, 0.4)  if(z not in out):    out[z] = 1  else:    out[z] += 1print(out)
We find a handful of singletons (individual random values spreadanywhere from zero to one), as well as 337 cases of 0.2, 329 cases of0.4, and 324 cases of 0.3. That looks good.

Working with Agents is trickier, since they communicate witheach other through the global variable sim in order to findopponents. We cannot just call a function in isolation. Nevertheless,we can do something similar. It is quite handy that we already ran thesimulation, so we have "live" agents still sitting around that we canexamine. Note: if you try thisat home, you will get entirely different results, depending on when youstop the simulation by closing the graphics window, since the agentsare constantly evolving, so view the following discussion as anexample of the thought-process, rather than as something you canreproduce exactly.

We pick an agent, call it 'A', and an opponent for it, callit 'B', and print the internal structure of A and B using
A = sim[12][34]B = A.opponent()print(B.i, B.j, B.uniqueId)print(A.i, A.j, A.uniqueId)print(B.g)print(B.prevOutcome)print(B.prevChoice)
The extensive output (not shown here) tells me that B is at(i,j)=(11,33), so B is indeed one of the 8 cells neighboring A,because A is at (12,34). Also, B has the following gene values:p=0.60, q=0.43, r=0.40, s=0.07. And, last time B battled with A, Bchose to fight, and received a payoff of -1. Also all 8 of B'sneighbors show up with past choices and payoffs, so we knowthe opponent function is working.

Now we are in a position to test the Agent choicefunction, which is at the heart of the simulation. Let B make 10,000choices against A, and see what happens. Since B memorizes its newchoice (in the last line of the function), we have to save and restorethe old value each time around the loop, to avoid contaminating therandom trials:
out = {}save = B.prevChoice[A.uniqueId]for i in range(0,10000):  z = B.choice(A)  B.prevChoice[A.uniqueId] = save  if(z not in out):    out[z] = 1  else:    out[z] += 1print(out)
We get 37% 'cooperate' and 63% 'fight' outcomes. Is this correct?Well, we have to walk through the logic. In r=40% of cases, we aremyopic, with s=7% cooperate and 93% fight. In the other 60% of cases,we are a memorizer, whose previous outcome was -1 after 'f'. Since welost, in q=43% of cases we repeat (fight), and in 57% we flip (tocooperate). Totaling these up, we get 0.4*0.93+0.6*0.43 = 63%fight. It matches!

Finally, let's check an actual battle between A and B.
print(B.score, B.prevOutcome, B.prevChoice)print(A.score, A.prevOutcome, A.prevChoice)B.battle(A)print(B.score, B.prevOutcome, B.prevChoice)print(A.score, A.prevOutcome, A.prevChoice)
When I tried this, agent A had just been updated, so it started outwith score 10 and no recorded history. Agent B lost 2 points and Agained 2, so B must have selected cooperation and A must havefought. And indeed, that is what their subsequent history shows. So,again, the results at the micro-scale match what we expected.

We could keep going, but you get the idea.

This kind of thing can be painfully tedious, but it isalso essential. Though I consider myself highly skilled atcomputer programming, I found and fixed both a typo and a conceptualerror in my original code by doing exactly this sort of detailedanalysis back when I wrote the previous post. Donald Knuth, thefamous Stanford professor of computer science and author ofthe Art ofComputer Programming series of books (which I highly recommend asreference books - they are very dense but full of fascinating detail),often offered a monetary reward to anyone that found a new bugin one of his published programs. I won't go that far, but if youdo find a remaining bug in this evolutionary game code, please doleave a comment so I can fix it.

The bottom line is that bugs do happen, in everyone's code -even in spreadsheets - and the larger and messier the code or thespreadsheet, the more often bugs happen. Testing is essential, and ifyou cannot test the entire program, at least work out some key stepsby hand and verify that the code reproduces them.

Write a White Paper

Back to our high level list of things you can do to improve the oddsthat your program is correct. One of my favorites is to write a whitepaper.

By a "white paper", I mean a detailed description of the problem, thedata sources, the constraints and other assumptions, the solutionalgorithm, the code itself, the validation tests, and aninterpretation of the resulting output. Much more than mere"documentation", or a "user guide", or even a "tutorial", what I havein mind here is text that walks you through the whole problem, fromstart to finish, with the aim of teaching the underlying ideas.

It always amazes me how useful this method is. The mere actof sitting down and writing out, in English, what you intended, whatyou did, and what you learned, is enormously helpful in clarifyingyour own understanding. In order to properly teach something,you must first make it your own, and understand it from the groundup. So often, I see people apply mathematical ideas by rote, notreally understanding why they follow certain steps. While"cookbook" solutions can be helpful starting points, real problems ofsubstance have enough unique features that you need to know whythings are done a certain way - and when to deviate. Writing a whitepaper should force you to confront those steps you are uncertain aboutand determine whether they are correct or how to do themcorrectly. The paper is also a great place to report the confidencebuilding tests you performed - and the act of writing it will inspireyou to think of yet other tests. I highly recommend the process!

I hope you found these ideas useful. You might also want toread How Do We Know?, for some related thoughts. As usual, please postquestions, comments and other suggestions, or email me directly at theaddress described at the end of the Welcome post. Remember you cansign up for email alerts about new posts by entering your address inthe widget on the sidebar. Alternatively, if you like Twitter, follow@ingThruMath to get a 'tweet' for each newpost. About the Blog has some additional pointers for newcomers, who mayalso want to look atthe Contents page for a complete list of previous articles. See younext time!

Hiç yorum yok:

Yorum Gönder