Dan Dot Blog

Based on a true story

Modeling Reality

I had started this long post on data, open source development, and greater availability of public records, but it’s sounding like I usually do when trying to forcibly relate 10 different themes I have playing through my brain. I’m sure none of you want to hear my sophomoricly newspaperman pitch of a subject that, while terribly exciting to me, I can hardly claim any authority on. So instead I’m going to start with my experiences, and a challenge I’m having right now between a life of exploratory vs confirmatory science.

My whole life I have really enjoyed structure. When I’m learning something new, I like to try to shake out the underlying concepts then stretch them to their breaking points in an attempt to better understand the deep structure that governs the thing I’m learning about. This approach requires a good combination of both deduction and empiricism. I start with a reasonable postulate handed down by an authority, draw a timid analogy to some perhaps related spec of information, then take one end of my new meme and run as fast as I can in one direction, testing its elasticity and predictive power until it snaps and I am whiplashed back towards my starting point, whereupon I pick up a new rubber idea and run out in a new direction. This approach relies on reasoning and flashes of insight to get anywhere. I love explaining things, even when I’m terribly unsure. Ask me anything and I’ll probably give you an answer, even if my knowledge on the subject is extraordinarily limited. Instead, I treat such inquiries as invitations to ponder, to explore mental space, to simulate, make wild predictions, and arrive at some semblance of truth that is scarcely plausible.

This is what excited and frustrated me about psychology for so long. Whenever preparing talks or presentations, I would be slapped down for speculating about my results, for drawing deep inferences in a shallow pool of data. For me to be excited about my data, it had to make sense, even if the story that explained it was far-fetched or otherwise ill-supported. These games helped me to organize my knowledge in useful ways. Sure, I was gaining relatively little truth, but my new thoughts weren’t pure noise. Indeed, underneath the noise of the signal were kernels of truth, and indeed the noise itself told me something about my thinking. I found that this approach really behooved learning new knowledge. Even tenuous connections to far-flung data give more attachment points, keeping the idea from floating out of my head like a hot air balloon. It seemed that connections were what mattered, not whether they made sense. That would all get sorted out in time.

Then I started playing with and thinking about computational statistics. As computational power increases, our time saving truth-seeking heuristics are obviated. The deductive forces that guide and ration our precious mental resources are more harmful than helpful. Reading a lot of science fiction about a computational singularity, it seemed like our old ways of reasoning made less and less sense. Random walks through understanding started to be a lot more appealing. All of the cleverness, innovation, and gestalt insight that went into their optimization was made obsolete with the promise of the ability to brute force anything.

In my computational statistics class, we learned about Monte Carlo integration. Wikipedia can probably do a better job of explaining it than me, but I’m going to try. If you want to skip my attempt at explanation, I’ll blockquote it so you can just jump over it.

The idea is really neat and beautifully inverts the historic relationship between statistics and mathematics (especially probability). In statistics, when you want to predict the behavior of a random variable, you magically arrive at some of its fundamental properties, namely its probability density function (pdf). From that, you simply find the integral of its distribution multiplied by the function of the random variable you’re interested in the behavior of [take f(x)=x for an easy example], and voila, you have the expectation of that function of a random variable. This all works great on a chalk board, but in reality, integration is hard, even to approximate, when dealing with functions (typically from the pdf) that are at all unusual, and in practice, pdf’s of even vanilla random variables, like a standard normal random variable, can be unfathomably complicated (see below).

It’s enough to make you grimace. But why not turn that frown upside-down and the whole process while you’re at it. Let’s say you start with a really hard integral. What if you could rewrite it as the function of a random variable multiplied by the pdf of that random variable? Well, then its integral would be equal to the expectation of the random variable, which you can’t really know unless your approach has resulted in a boring random variable (in which case somebody else has probably already done everything interesting there is to do with it). But you can take a guess at its expectation. Let’s take another simple example. Imagine you want to integrate something unpleasant that easily falls apart as f(x) times that pdf that I put above. All you’d need to do is get some observations of a standard normal variable (you can get close if you just ask everybody around you their height then standardize the results), apply whatever function you wanted to each data point, then find the mean, which is not a terribly bad estimator of the expectation of that variable (and is typically the best “unbiased” estimator of the expectation of your random variable).

If you’re still reading and care what an unbiased estimator is, it’s simply an estimator where you expect to get the expectation. Crazy, huh? Some times, the best estimator is a biased one. Imagine you have some fair dice. They’re normal dice, except I’ve written “one billion” on the side where the one should go. Imagine you only get two rolls, and you have to figure out the average outcome of the dice rolls. You could do your two rolls, take the average, and call it a day. But what if in your two rolls you just so happened to not turn up “one billion.” Your estimate is going to be way off! What if on the other hand, you just decided, without even rolling, that your estimate will be “one billion.” Doesn’t seem very empirical, indeed, it seems like you’re biased before you even started, but, except for on “the price is right,” the second version of you is probably going to be closer most of the time.

Qualms about true randomness aside, it’s not that hard to generate observations of a random variable. You can do it in Excel. Sure, the actual algorithm that produces it may not be perfectly random, but it’s pretty damn close. In order to get your initially difficult integral into something manageable, you may have had to make your function something ridiculous to leave a lame-ass PDF. But since you’re already fudging things anyway, why not just fudge the generation of the random variable? Maybe the PDF corresponds to a random variable you can’t do a reasonable job of simulating, so instead, you just simulate a random variable you do particularly enjoy, then do adjustments for how close what you generated is to what is reasonable under the original PDF. If your random variable of choice is not a good approximation of the real random variable, most of your observations are not going to be worth much. But…………………if you can get so so so so many observations that it doesn’t matter, you don’t have to spend much time being clever with choosing a good random variable.

Still with me? The moral of the above story is that the problems with Monte Carlo techniques that historically have been solved with cleverness can now be solved with brute force. If my computer is strong enough, I can bend math to its will. I can simulate anything. Let’s have a quick thought experiment. Give me the following:

  1. An immortal human. They can be mutable, but they need to be able to perform the below described task every 5 minutes for ever and ever.
  2. Infinite computing power
  3. A pen and paper

Every 5 minutes, the person pauses for a second, thinks up a 15 digit random number, and writes it down. This all gets fed into a computer. Let’s take a super inter-connected view of human cognition and assert that every fiber of your being affects every thing you do. Ergo, the numbers you choose are a reflection of exactly who you are at that point in time. However, there is likely another person, who is not the same as you, who, at some given point in time, might generate that same number. So, it’s not exactly one-to-one, is it? However, if you keep giving little flashes into who you are in the form of these numbers, you are going to create an infinitely complex pattern that truly is one to one. There is only one YOU that would generate this extremely long string of 15 digit numbers. Let’s be super unambitious and super unclever and try to come up with the algorithm that you’re fundamentally using to generate numbers. Let’s try to write your brain in visual basic, taking the chimpanzees on a type-writer approach.

Let a computer randomly write code, run it, and see what it gets. If it matches what you have produced so far, it’s pretty close. Once it starts predicting what you’re going to do in the future, it’s even closer. Sure, it’s going to take it a lot of tries to get it right, but don’t forget that you gave me infinite computing power in number 2. Let’s get more meta and let it also develop algorithms to evaluate if it’s getting closer or farther from being right rather than just stumbling around in the darkness. Let’s step back even further and let it write algorithms that evaluate those algorithms, AD NAUSEUM!

All of a sudden, it’s theoretically possible that it’s going to model not just your number generating algorithm, but you, and how is a perfect model of you any different than the real you? Let’s blow our minds even further and say that it can model the whole damn universe that led to your being created and agreeing to the stupid rules of this stupid thought experiment.

*pause*

Now, while I may be able to find number 3, I’m not likely to come across 1 or 2 anytime soon, but it’s kind of creepy when pushed to the limits. All of a sudden, data is so much more important than any sort of clever insights we may make into it. I was initially terrified by the idea of this self-organizing computer, getting smarter at making itself smarter and running simulations of anything conceivable. I smirked to myself at presentations I attended while people tried to explain data using kitschy, home-brewed theories. Even perfectly reasonable ideas started to seem shaky. Why should people die when deprived of oxygen? That’s a handy notion, but there’s a far more complicated structure at play underneath, something whose structure is unfathomably complex and beyond our articulation. This reared its ugly head even moreso in psychology, where we do factor analysis and then come up with cute names for scales based on how the items feel like they hang together. Sure before we trust somebody to do this they have to spend years wading in the literature and learning what their predecessors have thought, but isn’t it all just alchemy as people simplify beautifully complex structure into feel good aphorisms that can be explained in a few sentences?

I was really bummed about the capacity of the human brain. Our little notions of the world were handy for keeping us alive, but ultimately didn’t even begin to scratch the surface of reality, but then, while walking about glumly, trying to wrestle with this problem semantically and deductively (which is kind of ironic I suppose) I came to some peace.

Our articulated knowledge is an attempt to express this more complex structure in some simplified rule, but our behavior doesn’t always follow our declarative knowledge. When you ask me to explain why something works the way it does, I’m going to give you the best estimator I can lazily produce, but it may be a pretty biased one. But ask me to bet money on what number is going to come up on the die, and suddenly I’m playing a more complex game.

This is something that goldfish can do but humans struggle with. If we’re flipping a coin, and I tell you I’ll give you a dollar every time you’re right, and take a dollar every time you’re wrong, even if I tell you that it is an unfair coin that is manufactured to come up heads 55% of the time and tails 45% of the time, you’re probably not going to adopt the best strategy, which would be to just trust me and call it heads every time. You may be able to explain to me the mathematical proof that argues for you behaving like that, but in your actions you refuse to believe that the system is that simple. You’re gathering additional data like the position of my hand, wind, speed, rotational intertia, and trying to somehow build this much more complicated model of how the coin really behaves. Because nothing is really that simple. Coins don’t follow the rules that we set for them. We don’t do the “right” thing in every situation, because in reality, that is unknowable! Sure, our strategy is sub-optimal if the rules really are that simple, but they aren’t. Our brains are doing all the self-organization, and meta organization that I feared computers could do. We’ve developed all of these meta meta meta algorithms that govern how we develop new algorithms, with clever little heuristics and razor, nifty principles like parsimony, not because they’re true, but because they (probably) guide us towards building a better internal model of the universe.

Because at the end of the day, that’s all we’re doing our whole lives. We’re taking in sensory data and trying to make sense of it, trying to create some sort of internal representation of the universe. I’m sure some law of thermodynamics or the uncertainty principle or some other vaguely invoked rule of physics would argue that something inside of a system can’t possibly make total sense of that system, but we can sure try. So I’ll keep telling my far-fetched stories about why things are the way they are but with the added wisdom that while I’m probably not right, that doesn’t mean the lies are without value.

Editor’s Note: This is really long and ungainly and I’m super impressed if you made it this far. After writing it I’m not even willing to reread it, at least not immediately, so rather than sit on this post like I usually do, I’m just going to truck it out, in all its ugliness, and pick at it and clean it up and spring board off of it in the future.

February 16, 2010 Posted by | Academic, Personal | , , , , , | 3 Comments

Statistics Woo!

I hope in my next life I get to come back as a statistician. I guess there’s nothing stopping me from doing that now, and maybe it’s just the Monster talking, but right now there seems to be nothing more interesting to me regression, link functions, and ordinal regression. There’s just this awesome beauty in using math backwards that requires creativity and insight. This really started to click for me when I learned about Monte Carlo integration. How clever to realize that, as computing power grew, instead of using integration to figure out how random variables would behave, we can instead use their simulated behavior to do integration! How many other mathematical things can we turn on their head? When will we use chemistry to understand addition, or physics to understand differentiation?

December 10, 2009 Posted by | Academic, Personal | , , , | Leave a comment