Quantcast
Channel: MoneyScience: MoneyScience's news channel - Blog > Three-Toed Sloth
Viewing all 67 articles
Browse latest View live

Friday Cat Blogging (Kara vs. Nietzsche Issue of Non-Science-Geek Edition)

$
0
0

But enough about the single worst book I read last year. Here, in honor of National Black Cat Appreciation Day, is Kara helping me sort through relics from graduate school:

Cat vs. Nietzsche

Kara feels that section 207, on how the spirit of objectivity, and the type of person who cultivates it, have merely instrumental value ("a mirror" in "the hand of one more powerful") is alright as far as it goes, but it misses the more fundamental truth, that all human beings have merely instrumental value — in the service of cats.

Friday Cat Blogging


Class Announcement: 36-350, Statistical Computing

$
0
0

It's that time of year again:

36-350, Statistical Computing
Description: Computational data analysis is an essential part of modern statistics. Competent statisticians must not just be able to run existing programs, but to understand the principles on which they work. They must also be able to read, modify and write code, so that they can assemble the computational tools needed to solve their data-analysis problems, rather than distorting problems to fit tools provided by others. This class is an introduction to programming, targeted at statistics majors with minimal programming knowledge, which will give them the skills to grasp how statistical software works, tweak it to suit their needs, recombine existing pieces of code, and when needed create their own programs.
Students will learn the core of ideas of programming — functions, objects, data structures, flow control, input and output, debugging, logical design and abstraction — through writing code to assist in numerical and graphical statistical analyses. Students will in particular learn how to write maintainable code, and to test code for correctness. They will then learn how to set up stochastic simulations, how to parallelize data analyses, how to employ numerical optimization algorithms and diagnose their limitations, and how to work with and filter large data sets. Since code is also an important form of communication among scientists, students will learn how to comment and organize code.
The class will be taught in the R language.
Pre-requisites: This is an introduction to programming for statistics students. Prior exposure to statistical thinking, to data analysis, and to basic probability concepts is essential, as is some prior acquaintance with statistical software. Previous programming experience is not assumed, but familiarity with the computing system is. Formally, the pre-requisites are "Computing at Carnegie Mellon" (or consent of instructor), plus one of either 36-202 or 36-208, with 36-225 as either a pre-requisite (preferable) or co-requisite (if need be).
The class may be unbearably redundant for those who already know a lot about programming. The class will be utterly incomprehensible for those who do not know statistics and probability.

Further details can be found at the class website. Teaching materials (lecture slides, homeworks, labs, etc.), will appear both there and here.

— This is the second time the department has done this course. Last year, Vince Vu and I made it up, and I learned a lot from teaching it with him. This year, he's been carried to Ohio in a swarm of bees got a job at Ohio State, so I'll have to pull it off without his help. Another change is that instead of having a (very reasonable) 26 students, I've got 42 people signed up and more on the waiting list. Only 23 of those are statistics majors, however, and I am tempted to scare the rest away not sure that the others fully appreciate what they have signed up for. We'll see.

Corrupting the Young; Enigmas of Chance; Introduction to Statistical Computing

Orientation

$
0
0
Attention conservation notice: Posted because I got tired of repeating it to nervous new graduate students. You are not beginning graduate school at a research university. (Any resemblance to how I treat the undergrads in ADA is entirely deliberate.)

Graduate school, especially at the beginning, is an ego-destroying, even humiliating, experience. People who are used to being good at what they do get set impossible tasks by elders they look up to, and the students have their faces ground in their failures. Everything is designed to encourage self-doubt. This is not an accident. Graduate school is not just about teaching certain specific skills; it is about breaking down your old persona to make room for a new one, to turn out a certain kind of person. It has evolved so that authority figures make it very clear that the new inductees possess no accomplishments of any worth — but that if they work very hard, beyond ordinary expectation, they can emerge as the peers of the authorities, the only kind of person worthy of respect, members for life of the fraternity.

In other words: welcome to boot camp, maggots! It's a sad day for our beloved discipline when we have to take unpromising specimens like you — and yes, that includes you, I know your type — but maybe, possible, just maybe, one or two of you might have what it takes to become scholars...

Graduate school is not, obviously, physically demanding. (For that matter, few of your instructors can pull off battle-dress and a Sam Browne belt; avoid the ones who try.) Our version of "Drop and give me fifty" is "Prove these functions are measurable" — or, perhaps more insidiously, "Really? What does the literature say about that point?" (Actual question at an oral exam in my physics department: "Explain how a radio works. You may start from Maxwell's equations.") But we are playing the same mind-games: removing you from your usual friends and associates, working you constantly, dis-orienting you with strange new concepts and vocabulary, surrounding you with people who are either in on the act or also enduring the initiation, and perpetually reinforcing that the only thing which matters is whether you excel in this particular program. This is an institution which has persisted over, literally, a thousand years by consuming young people like you and spitting out scholars with the attitudes and habits of mind it needs to perpetuate itself, and it is very good at getting inside your head.

There are many ways to cope with this, but what I would suggest is to remember, fiercely, that it's all bullshit. You are a bad-ass, whatever happens in school. It may have practical consequences, but it doesn't matter, and any impression we give to the contrary is just part of the bullshit. Hold on to that fact, internalize it, feel it, and let the stress and strain pass through you without distorting you.

You will still prove the theorems I tell you to prove, but you'll remember that this doesn't matter to who you are. If you decide to care about academia, it will be a conscious choice, and not the institution turning you into its means of reproduction.

Corrupting the Young

Basic Data Types and Data Structures (Introduction to Statistical Computing)

$
0
0

Introduction to the course: statistical programming for autonomy, honesty, and clarity of thought. The functional programming idea: write code by building functions to transform input data into desired outputs. Basic data types: Booleans, integers, characters, floating-point numbers. Subtleties of floating point numbers. Operators as basic functions. Variables and names. An example with resource allocation. Related pieces of data are bundled into larger objects called data structures. Most basic data structures: vectors. Some vector manipulations. Functions of vectors. Naming of vectors. Continuing the resource-allocation example. Building more complicated data structures on top of vectors. Arrays as a first vector structure.

Slides

Introduction to Statistical Computing

Robins and Wasserman Respond to a Nobel Prize Winner

$
0
0
Attention conservation notice: 2500 words of statisticians quarreling with econometricians about arcane points of statistical theory.
Long, long ago, I tried to inveigle Larry into writing this, by promising I would make it a guest post. Larry now has his own blog, but a promise is a promise. More to the point, while I can't claim any credit for it, I'm happy to endorse it, and to pushing it along by reproducing it. Everything between the horizontal lines is by Jamie and Larry, though I tweaked the formatting trivially.

Robins and Wasserman Respond to a Nobel Prize Winner

James Robins and Larry Wasserman

Note: This blog post is written by two people and it is cross-posted at Normal Deviate and Three Toed Sloth.

Chris Sims is a Nobel prize winning economist who is well known for his work on macroeconomics, Bayesian statistics, vector autoregressions among other things. One of us (LW) had the good fortune to meet Chris at a conference and can attest that he is also a very nice guy.

Chris has a paper called On an An Example of Larry Wasserman. This post is a response to Chris' paper.

The example in question is actually due to Robins and Ritov (1997). A simplified version appeared in Wasserman (2004) and Robins and Wasserman (2000). The example is related to ideas from the foundations of survey sampling (Basu 1969, Godambe and Thompson 1976) and also to ancillarity paradoxes (Brown 1990, Foster and George 1996).

1. The Model

Here is (a version of) the example. Consider iid random variables \[ (X_1,Y_1,R_1),\ldots, (X_n,Y_n,R_n). \] The random variables take values as follows: \[ X_i \in [0,1]^d,\ \ \ Y_i \in\{0,1\},\ \ \ R_i \in\{0,1\}. \] Think of \( d \) as being very, very large. For example, \( d=100,000 \) and \( n=1,000 \).

The idea is this: we observe \( X_i \). Then we flip a biased coin \( R_i \). If \( R_i=1 \) then you get to see \( Y_i \). If \( R_i=0 \) then you don't get to see \( Y_i \). The goal is to estimate \[ \psi = P(Y_i=1). \] Here are the details. The distribution takes the form \[ p(x,y,r) = p_X(x) p_{Y|X}(y|x)p_{R|X}(r|x). \] Note that \( Y \) and \( R \) are independent, given \( X \). For simplicity, we will take \( p(x) \) to be uniform on \( [0,1]^d \). Next, let \[ \theta(x) \equiv p_{Y|X}(1|x) = P(Y=1|X=x) \] where \( \theta(x) \) is a function. That is, \( \theta:[0,1]^d \rightarrow [0,1] \). Of course, \[ p_{Y|X}(0|x)= P(Y=0|X=x) = 1-\theta(x). \] Similarly, let \[ \pi(x)\equiv p_{R|X}(1|x) = P(R=1|X=x) \] where \( \pi(x) \) is a function. That is, \( \pi:[0,1]^d \rightarrow [0,1] \). Of course, \[ p_{R|X}(0|x)= P(R=0|X=x) = 1-\pi(x). \]

The function \( \pi \) is known. We construct it. Remember that \( \pi(x) = P(R=1|X=x) \) is the probability that we get to observe \( Y \) given that \( X=x \). Think of \( Y \) as something that is expensive to measure. We don't always want to measure it. So we make a random decision about whether to measure it. And we let the probability of measuring \( Y \) be a function \( \pi(x) \) of \( x \). And we get to construct this function.

Let \( \delta>0 \) be a known, small, positive number. We will assume that \[ \pi(x)\geq \delta \] for all \( x \).

The only thing in the the model we don't know is the function \( \theta(x) \). Again, we will assume that \[ \delta \leq \theta(x) \leq 1-\delta. \] Let \( \Theta \) denote all measurable functions on \( [0,1]^d \) that satisfy the above conditions. The parameter space is the set of functions \( \Theta \).

Let \( {\cal P} \) be the set of joint distributions of the form \[ p(x) \, \pi(x)^r (1-\pi(x))^{1-r}\, \theta(x)^y (1-\theta(x))^{1-y} \] where \( p(x)=1 \), and \( \pi(\cdot) \) and \( \theta(\cdot) \) satisfy the conditions above. So far, we are considering the sub-model \( {\cal P}_\pi \) in which \( \pi \) is known.

The parameter of interest is \( \psi = P(Y=1) \). We can write this as \[ \psi = P(Y=1)= \int_{[0,1]^d} P(Y=1|X=x) p(x) dx = \int_{[0,1]^d} \theta(x) dx. \] Hence, \( \psi \) is a function of \( \theta \). If we know \( \theta(\cdot ) \) then we can compute \( \psi \).

2. Frequentist Analysis

The usual frequentist estimator is the Horwitz-Thompson estimator \[ \hat\psi = \frac{1}{n}\sum_{i=1}^n \frac{ Y_i R_i}{\pi(X_i)}. \] It is easy to verify that \( \hat\psi \) is unbiased and consistent. Furthermore, \( \hat\psi - \psi = O_P(n^{-\frac{1}{2}}) \). In fact, let us define \[ I_n = [\hat\psi - \epsilon_n,\ \hat\psi + \epsilon_n] \] where \[ \epsilon_n = \sqrt{\frac{1}{2n\delta^2}\log\left(\frac{2}{\alpha}\right)}. \] It follows from Hoeffding's inequality that \[ \sup_{P\in{\cal P}_\pi} P(\psi \in I_n)\geq 1-\alpha \] Thus we have a finite sample, \( 1-\alpha \) confidence interval with length \( O(1/\sqrt{n}) \).

Remark: We are mentioning the Horwitz-Thompson estimator because it is simple. In practice, it has three deficiencies:

  1. It may exceed 1.
  2. It ignores data on the multivariate vector \( X \) except for the one dimensional summary \( \pi(X) \).
  3. It can be very inefficient.
These problems are remedied by using an improved version of the Horwitz-Thompson estimator. One choice is the so-called locally semiparametric efficient regression estimator (Scharfstein et al., 1999): \[ \hat\psi = \int {\rm expit}\left(\sum_{m=1}^k \hat\eta_m \phi_m(x) + \frac{\hat\omega}{\pi(x)}\right)dx \] where \( {\rm expit}(a) = e^a/(1+e^a) \), \( \phi_{m}\left( x\right) \) are basis functions, and \( \hat\eta_1,\ldots,\hat\eta_k,\hat\omega \) are the mle's (among subjects with \( R_i=1 \)) in the model \[ \log\left( \frac{P(Y=1|X=x)}{P(Y=0|X=x)}\right) = \sum_{m=1}^k \eta_m \phi_m(x) + \frac{\omega}{\pi(x)}. \] Here \( k \) can increase slowly with \( n. \) Recently even more efficient estimators have been derived. Rotnitzky et al (2012) provides a review. In the rest of this post, when we refer to the Horwitz-Thompson estimator, the reader should think ``improved Horwitz-Thompson estimator.'' End Remark.

3. Bayesian Analysis

To do a Bayesian analysis, we put some prior \( W \) on \( \Theta \). Next we compute the likelihood function. The likelihood for one observation takes the form \( p(x) p(r|x) p(y|x)^r \). The reason for having \( r \) in the exponent is that, if \( r=0 \), then \( y \) is not observed so the \( p(y|x) \) gets left out. The likelihood for \( n \) observations is \[ \prod_{i=1}^n p(X_i) p(R_i|X_i) p(Y_i|X_i)^{R_i} = \prod_i \pi(X_i)^{R_i} (1-\pi(X_i))^{1-R_i}\, \theta(X_i)^{Y_i R_i} (1-\theta(X_i))^{(1-Y_i)R_i}. \] where we used the fact that \( p(x)=1 \). But remember, \( \pi(x) \) is known. In other words, \( \pi(X_i)^{R_i} (1-\pi(X_i))^{1-R_i} \) is known. So, the likelihood is \[ {\cal L} (\theta) \propto \prod_i \theta(X_i)^{Y_i R_i} (1-\theta(X_i))^{(1-Y_i)R_i}. \]

Combining this likelihood with the prior \( W \) creates a posterior distribution on \( \Theta \) which we will denote by \( W_n \). Since the parameter of interest \( \psi \) is a function of \( \theta \), the posterior \( W_n \) for \( \theta \) defines a posterior distribution for \( \psi \).

Now comes the interesting part. The likelihood has essentially no information in it.

To see that the likelihood has no information, consider a simpler case where \( \theta(x) \) is a function on \( [0,1] \). Now discretize the interval into many small bins. Let \( B \) be the number of bins. We can then replace the function \( \theta \) with a high-dimensional vector \( \theta = (\theta_1,\ldots, \theta_B) \). With \( n < B \), most bins are empty. The data contain no information for most of the \( \theta_j \)'s. (You might wonder about the effect of putting a smoothness assumption on \( \theta(\cdot ) \). We'll discuss this in Section 4.)

We should point out that if \( \pi(x) = 1/2 \) for all \( x \), then Ericson (1969) showed that a certain exchangeable prior gives a posterior that, like the Horwitz-Thompson estimator, converges at rate \( O(n^{-1/2}) \). However we are interested in the case where \( \pi(x) \) is a complex function of \( x \); then the posterior will fail to concentrate around the true value of \( \psi \). On the other hand, a flexible nonparametric prior will have a posterior essentially equal to the prior and, thus, not concentrate around \( \psi \), whenever the prior \( W \) does not depend on the the known function \( \pi(\cdot) \). Indeed, we have the following theorem from Robins and Ritov (1997):

Theorem (Robins and Ritov 1997). Any estimator that is not a function of \( \pi(\cdot) \) cannot be uniformly consistent.

This means that, at no finite sample size, will an estimator \( \hat\psi \) that is not a function of \( \pi \) be close to \( \psi \) for all distributions in \( {\cal P} \). In fact, the theorem holds for a neighborhood around every pair \( (\pi,\theta) \). Uniformity is important because it links asymptotic behavior to finite sample behavior. But when \( \pi \) is known and is used in the estimator (as in the Horwitz-Thompson estimator and its improved versions) we can have uniform consistency.

Note that a Bayesian will ignore \( \pi \) since the \( \pi(X_i)'s \) are just constants in the likelihood. There is an exception: the Bayesian can make the posterior be a function of \( \pi \) by choosing a prior \( W \) that makes \( \theta(\cdot) \) depend on \( \pi(\cdot) \). But this seems very forced. Indeed, Robins and Ritov showed that, under certain conditions, any true subjective Bayesian prior \( W \) must be independent of \( \pi(\cdot) \). Specifically, they showed that once a subjective Bayesian queries the randomizer (who selected \( \pi \)) about the randomizer's reasoned opinions concerning \( \theta (\cdot) \) (but not \( \pi(\cdot) \)) the Bayesian will have independent priors. We note that a Bayesian can have independent priors even when he believes with probabilty 1 that \( \pi \left( \cdot \right) \) and \( \theta \left( \cdot \right) \) are positively correlated as functions of \( x \) i.e. \( \int \theta \left( x\right) \pi \left( x\right) dx>\int \theta \left(x\right) dx \) \( \int \pi \left( x\right) dx. \) Having independent priors only means that learning \( \pi \left(\cdot \right) \) will not change one's beliefs about \( \theta \left( \cdot \right) \). So far, so good. As far as we know, Chris agrees with everything up to this point.

4. Some Bayesian Responses

Chris goes on to raise alternative Bayesian approaches.

The first is to define \[ Z_i = \frac{R_i Y_i}{\pi(X_i)}. \] Note that \( Z_i \in \{0\} \cup [1,\infty) \). Now we ignore (throw away) the original data. Chris shows that we can then construct a model for \( Z_i \) which results in a posterior for \( \psi \) that mimics the Horwitz-Thompson estimator. We'll comment on this below, but note two strange things. First, it is odd for a Bayesian to throw away data. Second, the new data are a function of \( \pi(X_i) \) which forces the posterior to be a function of \( \pi \). But as we noted earlier, when \( \theta \) and \( \pi \) are a priori independent, the \( \pi(X_i)'s \) do not appear in the posterior since they are known constants that drop out of the likelihood.

A second approach (not mentioned explicitly by Chris) which is related to the above idea, is to construct a prior \( W \) that depends on the known function \( \pi \). It can be shown that if the prior is chosen just right then again the posterior for \( \psi \) mimics the (improved) Horwitz-Thompson estimator.

Lastly, Chris notes that the posterior contains no information because we have not enforced any smoothness on \( \theta(x) \). Without smoothness, knowing \( \theta(x) \) does not tell you anything about \( \theta(x+\epsilon) \) (assuming the prior \( W \) does not depend on \( \pi \)).

This is true and better inferences would obtain if we used a prior that enforced smoothness. But this argument falls apart when \( d \) is large. (In fairness to Chris, he was referring to the version from Wasserman (2004) which did not invoke high dimensions.) When \( d \) is large, forcing \( \theta(x) \) to be smooth does not help unless you make it very, very, very smooth. The larger \( d \) is, the more smoothness you need to get borrowing of information across different values of \( \theta(x) \). But this introduces a huge bias which precludes uniform consistency.

5. Response to the Response

We have seen that response 3 (add smoothness conditions in the prior) doesn't work. What about response 1 and response 2? We agree that these work, in the sense that the Bayes answer has good frequentist behavior by mimicking the (improved) Horwitz-Thompson estimator.

But this is a Pyrrhic victory. If we manipulate the data to get a posterior that mimics the frequentist answer, is this really a success for Bayesian inference? Is it really Bayesian inference at all? Similarly, if we choose a carefully constructed prior just to mimic a frequentist answer, is it really Bayesian inference?

We call Bayesian inference which is carefully manipulated to force an answer with good frequentist behavior, frequentist pursuit. There is nothing wrong with it, but why bother?

If you want good frequentist properties just use the frequentist estimator. If you want to be a Bayesian, be a Bayesian but accept the fact that, in this example, your posterior will fail to concentrate around the true value.

6. Summary

In summary, we agree with Chris' analysis. But his fix is just frequentist pursuit; it is Bayesian analysis with unnatural manipulations aimed only at forcing the Bayesian answer to be the frequentist answer. This seems to us to be an admission that Bayes fails in this example.

7. References

Basu, D. (1969). Role of the Sufficiency and Likelihood Principles in Sample Survey Theory. Sankya, 31, 441--454.

Brown, L. D. (1990). An ancillarity paradox which appears in multiple linear regression. The Annals of Statistics, 18, 471-493.

Ericson, W. A. (1969). Subjective Bayesian models in sampling finite populations. Journal of the Royal Statistical Society. Series B, 195-233.

Foster, D. P. and George, E. I. (1996). A simple ancillarity paradox. Scandinavian journal of statistics, 233-242.

Godambe, V. P., and Thompson, M. E. (1976), Philosophy of Survey-Sampling Practice. In Foundations of Probability Theory, Statistical Inference and Statistical Theories of Science, eds. W. L.Harper and A. Hooker, Dordrecht: Reidel.

Robins, J. M. and Ritov, Y. (1997). Toward a Curse of Dimensionality Appropriate (CODA) Asymptotic Theory for Semi-parametric Models. Statistics in Medicine, 16, 285--319.

Robins, J. and Wasserman, L. (2000). Conditioning, likelihood, and coherence: a review of some foundational concepts. Journal of the American Statistical Association, 95, 1340--1346

Rotnitzky, A., Lei, Q., Sued, M. and Robins, J. M. (2012). Improved double-robust estimation in missing data and causal inference models. Biometrika, 99, 439-456.

Scharfstein, D. O., Rotnitzky, A. and Robins, J.M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 1096-1120.

Sims, Christopher. On An Example of Larry Wasserman. Available at: http://www.princeton.edu/~sims/.

Wasserman, L. (2004). All of Statistics: a Concise Course in Statistical Inference. Springer Verlag.


Enigmas of Chance; Bayes, Anti-Bayes; Kith and Kin

More Data Structures (Introduction to Statistical Computing)

$
0
0

Lecture 2: Matrices as a special type of array; functions for matrix arithmetic and algebra: multiplication, transpose, determinant, inversion, solving linear systems. Using names to make calculations clearer and safer: resource-allocation mini-example. Lists for combining multiple types of values; access sub-lists, individual elements; ways of adding and removing parts of lists. Lists as key-value pairs. Data frames: the data structure for classic tabular data, one column per variable, one row per unit; data frames as hybrids of matrices and lists. Structures of structures: using lists recursively to creating complicated objects; example with eigen.

Introduction to Statistical Computing

Rainfall, Data Structures, Obsessive Doodling (Introduction to Statistical Computing)

$
0
0

In which we practice working with data frames, grapple with some of the subtleties of R's system of data types, and calculate the consequences of doodling while bored in lecture.

Assignment, due at 11:59 pm on Thursday, 6 September 2012

Introduction to Statistical Computing

Lab: Basic Probability, Basic Data Structures (Introduction to Statistical Computing)


Books to Read While the Algae Grow in Your Fur, August 2012

$
0
0

Attention conservation notice: I have no taste.

M. S. Bartlett, Stochastic Population Models in Ecology and Epidemiology
Short (real data.
Despite its technical obsolescence, I plan to rip it off shamelessly for examples when next I teach stochastic processes, or complexity and inference.
Patricia A. McKillip, The Bards of Bone Plain
Two interlocking stories about bards, far-separated in time, searching for secrets in poetry. Beautiful prose, as always, but for the first time that I can remember with one of McKillip's books, the ending felt rushed. Still eminently worth reading.
Lucy A. Snyder, Switchblade Goddess
Mind candy: a sorceress from Ohio continues (previous installments) her efforts to get out of Texas, but keeps being dragged back to various hells. I am deeply ambivalent about recommending it, however. The best I can do to say why, without spoilers, is that key parts of the book are at once effectively (even viscerally) narrated, and stuff I wish I'd never encountered. Mileage, as they say, varies.
Spoilers: Gur pbasyvpg orgjrra Wrffvr naq gur rcbalzbhf "fjvgpuoynqr tbqqrff" guvf gvzr vaibyirf abg whfg zvaq-tnzrf, nf va gur cerivbhf obbx, ohg ivivqyl qrfpevorq naq uvtuyl frkhnyvmrq obql ubeebe, jvgu Wrffr'f bja ernpgvbaf gb gur ivbyngvbaf bs ure obql naq zvaq orvat irel zhpu n cneg bs obgu gur frkhnyvgl naq gur ubeebe. Senaxyl, gubfr puncgref fdhvpxrq zr gur shpx bhg. V'z cerggl fher gung'f jung gubfr cnegf bs gur obbx jrer vagraqrq gb qb, fb cbvagf sbe rssrpgvir jevgvat, ohg V qvqa'g rawbl vg ng nyy. Cneg bs gung znl or gur pbagenfg gb gur guevyyvat-nqiragher gbar bs gur cerivbhf obbxf, naq rira zbfg bs guvf bar. (Znlor vs V'q tbar va rkcrpgvat ubeebe?)
Thomas W. Young, The Renegades
Mind candy; thriller about the US war in Afghanistan, drawing on the author's experience as a military pilot. I liked it — Young has a knack for effective descriptions in unflashy prose — but I am not sure if that wasn't because it played to some of my less-defensible prejudices. (Sequel to Silent Enemy and The Mullah's Storm, but self-contained.)
Lt. Col. Sir Wolseley Haig (ed.), Cambridge History of India, vol. III: Turks and Afghans
Picked up because I ran across it at the local used bookstore, and it occurred to me I knew next to nothing about what happened in India between the invasions by Mahmud of Ghazni and the conquest by Babur. What I was neglecting was that this was published in 1928...
Given over 500 pages to describe half a millennium of the life of one the major branches of civilization, do you spend them on the daily life and customs of the people, craft and technology, commerce, science, literature, religion, administration, and the arts? Or do you rather devote it almost exclusively to wars (alternately petty and devastating), palace intrigues, rebuking the long dead for binge drinking, and biologically absurd speculations on the "degeneration" of descendants of central Asians brought on by the climates of India, with a handful of pages that mention Persianate poetry, religion (solely as an excuse for political divisions) and tax farming? Evidently if you were a British historian towards the end of the Raj, aspiring to write the definitive history of India, c. +1000 to c. +1500, the choice was clear.
— To be fair, the long final chapter, on monumental Muslim architecture during the period, is informed and informative, though still full of pronouncements about the tastes and capacities of "the Hindu"*, "the Muslim"**, "the Persian"***, "the Arab" (and "Semitic peoples")****, , etc. And no doubt there are readers for whom this sort of obsessive recital of politico-military futility is actually useful, and would appreciate it being told briskly, which it is.
Recommended primarily if you want a depiction of 500 years of aggression and treachery that makes Game of Thrones seem like Jenny and the Cat Club.
*: "In the Indian architect this sense for the decorative was innate; it came to him as a legacy from the pre-Aryan races..."
**: "Elaborate decoration and brightly coloured ornament were at all times dear to the heart of the Muslim."
***: "[Persia's] genius was of the mimetic rather than the creative order, but she possessed a magic gift for absorbing the artistic creations of other countries and refining them to her own standard of perfection."
****: "With the Arabs, who in the beginning of the eighth century possessed themselves of Sind, our concern is small. Like other Semitic peoples they showed but little natural instinct for architecture or the formative arts." Not, "Our concern is small, because few of their works have survived, and they seem to have had little influence on what came later", which would have been perfectly reasonable.
Alexandre B. Tsybakov, Introduction to Nonparametric Estimation
What it says on the label. This short (~200 pp.) book is an introduction to the theory of non-parametric statistical estimation, divided, like Gaul, into three parts.
The first chapter introduces the basic problems considered: estimating a probability density function, estimating a regression function (with fixed and random placement of the input variable), and estimating a function observed through Gaussian noise. (The last of these has applications in signal processing, not discussed, and equivalences to the other problems, treated in detail.) The chapter then introduces the main methods to be used: kernel estimators, local polynomial, "projection" estimators (i.e., approximating the unknown function by a series expansion in orthogonal functions, especially but not exclusively Fourier expansions). The goal in this case is to establish upper bounds on the error of the function estimates, for different notions of error (mean-square at one point, mean-square averaged over space, maximum error, etc.). The emphasis is on finding the asymptotic rate at which these upper bounds go to zero. To achieve this, the text assumes that the unknown function lies in a space of functions which are more or less smooth, and upper-bounds how badly wrong kernels (or whatever) can go on such functions. (If you find yourself skeptically muttering "And how do I know the regression curve lies in a Sobolev \( \mathcal{S}(\beta,L) \) space1?", I would first of all ask you why assuming linearity isn't even worse, and secondly ask you to wait until the third chapter.) A typical rate here would be that the mean-squared error of kernel regression is \( O(n^{-2\beta/(2\beta+1)}) \), where \( \beta > 0 \) is a measure of the smoothness of the function class. While such upper bounds have real value, in reassuring us that we can't be doing too badly, they may leave us worrying that some other estimator, beyond the ones we've considered, would do much better.
The goal of the second chapter is to alleviate this worry, by establishing lower bounds, and showing that they match the upper bounds found in chapter 1. This is a slightly tricky business. Consider the calibrating macroeconomist Fool2 who says in his heart "The regression line is \( y = x/1600 \)", and sticks to this no matter what the data might be. In general, the Fool has horrible, \( O(1) \) error --- except when he's right, in which case his error is exactly zero. To avoid such awkwardness, we compare our non-parametric estimators to the minimax error rate, the error which would be obtained by a slightly-imaginary3 estimator designed to make its error on the worst possible function as small as possible. (What counts as "the worst possible function" depends on the estimator, of course.) The Fool is not the minimax estimator, since his worst-case error is \( O(1) \), and the upper bounds tell us we could at least get \( O(n^{-2\beta/(2\beta+1)}) \).
To get actual lower bounds, we use the correspondence between estimation and testing. Suppose we can always find two far-apart regression curves no hypothesis test could tell apart reliably. Then the expected estimation error has to be at least the testing error-rate times the distance between those hypotheses. (I'm speaking a little loosely; see the book for details.) To turn it around, if we can estimate functions very precisely, we can use our estimates to reliably test which of various near-by functions are right. Thus, invoking Neyman-Pearson theory, and various measures of distance or divergence between probability distributions, gives us fundamental lower bounds on function estimation. This reasoning can be extended to testing among more than two hypotheses, and to Fano's inequality. There is also an intriguing section, with new-to-me material, on Van Trees's inequality, which bounds Bayes risk4 in terms of integrated Fisher information.
It will not, I trust, surprise anyone that the lower bounds from Chapter 2 match the upper bounds from Chapter 1.
The rates obtained in Chapters 1 and 2 depend on the smoothness of the true function being estimated, which is unknown. It would be very annoying to have to guess this — and more than annoying to have to guess it right. An "adaptive" estimator, roughly speaking, is one which doesn't have to be told how smooth the function is, but can do (about) as well as one which was told that by an Oracle. The point of chapter 3 is to set up the machinery needed to examine adaptive estimation, and to exhibit some adaptive estimators for particular problems, mostly of the projection-estimator/series-expansion type. Unlike the first two chapters, the text of chapter 3 does not motivate itself very well, but the plot will be clear to experienced readers.
The implied reader has a firm grasp of parametric statistical inference (to the level of, say, Pitman or Casella and Berger) and of Fourier analysis, but in principle no more. There is a lot more about statistical theory that I have included in my quick sketch of the books' contents, such as the material on unbiased risk estimation, efficiency and super-efficiency, etc.; the patient reader could figure this all out from what's in Tsybakov, but either a lot of prior exposure, or a teacher, would help a lot. There is also nothing about data, or practical/computational issues (not even a mention of the curse of dimensionality!). The extensive problem sets at the end of each chapter will help with self-study, but I feel like this is really going to work best as a textbook, which is what it was written for. It would be the basis for an excellent one-semester course in advanced statistical theory, or, supplemented with practical exercises (and perhaps with All of Nonparametric Statistics) a first graduate5 class in non-parametric estimation.
1: As you know, Bob, that's the class of all functions which can be differentiated at least \( \beta \) times, and where the integral of the squared \( \beta^{\mathrm{th}} \) derivative is no more than \( L \). (Oddly, in some places Tsybakov's text has \( \beta-1 \) in place of \( \beta \), but I think the math always uses the conventional definition.) ^
2: To be clear, I'm the one introducing the character of the Fool here; Tsybakov is more dignified. ^
3: I say "slightly imaginary" because we're really taking an infimum over all estimators, and there may not be any estimator which actually attains the infimum. But "infsup" doesn't sound as good as "minimax". ^
4: Since Bayes risk integrated over a prior distribution on the unknown function, and minimax risk is risk at the single worst unknown function, Bayes risk provides a lower bound on minimax risk. ^
5: For a first undergraduate course in non-parametric estimation, you could use Simonoff's Smoothing Methods in Statistics, or even, if desperate, Advanced Data Analysis from an Elementary Point of View. ^
Peter J. Diggle and Amanda G. Chetwynd, Statistics and Scientific Method: An Introduction for Students and Researchers
I have mixed feelings about this.
Let me begin with the good things. The book's heart is very much in the right place: instead of presenting statistics as a meaningless collection of rituals, show it as a coherent body of principles, which scientific investigators can use as tools for inquiry. The intended audience is (p. ix) "first-year postgraduate students in science and technology" (i.e., what we'd call first-year graduate students), with "no prior knowledge of statistics", and no "mathematical demands... beyond a willingness to get to grips with mathematical notation... and an understanding of basic algebra". After some introductory material, a toy example of least-squares fitting, and a chapter on general ideas of probability and maximum likelihood estimation, Chapters 4--10 all cover useful statistical topics, all motivated by real data, which is used in the discussion*. The book treats regression modeling, experimental design, and dependent data all on an equal footing. Confidence intervals are emphasized over hypothesis tests, except when there is some substantive reason to want to test specific hypotheses. There is no messing about with commercial statistical software (there is a very brief but good appendix on R), and code and data are given to reproduce everything. Simulation is used to good effect, where older texts would've wasted time on exact calculations. I would much rather see scientists read this than the usual sort of "research methods" boilerplate.
On the negative side: The bit about "scientific method" in the title, chapter 1, chapter 7, and sporadically throughout, is not very good. There is no real attempt to grapple with the literature on methodology — the only philosopher cited is Popper, who gets invoked once, on p. 80. I will permit myself to quote the section where this happens in full.
7.2 Scientific Laws
Scientific laws are expressions of quantitative relationships between variables in nature that have been validated by a combination of observational and experimental evidence.

As with laws in everyday life, accepted scientific laws can be challenged over time as new evidence is acquired. The philosopher Karl Popper summarizes this by emphasizing that science progresses not by proving things, but by disproving them (Popper, 1959, p. 31). To put this another way, a scientific hypothesis must, at least in principle, be falsifiable by experiment (iron is more dense than water), whereas a personal belief need not be (Charlie Parker was a better saxophonist than John Coltrane).

7.3 Turning a Scientific Theory into a Statistical Model...

That sound you hear is pretty much every philosopher of science since Popper and Hempel, crying out from Limbo, "Have we lived and fought in vain?"

Worse: This has also got very little with what chapter 7 does, which is fit some regression models relating how much plants grow to how much of the pollutant glyphosphate they were exposed to. The book settles on a simple linear model after some totally ad hoc transformations of the variables to make that look more plausible. I am sure that the authors — who are both statisticians of great experience and professional eminence — would not claim that this model is an actual scientific law, but they've written themselves into a corner, where they either have to pretend that it is, or be unable to explain the scientific value of their model. (On the other hand, accounts of scientific method centered on models, e.g., Ronald Giere's, have no particular difficulty here.)
Relatedly, the book curiously neglects issues of power in model-checking. Still with the example of modeling the response of plants to different concentrations of pollutants, section 7.6.8 considers whether to separately model the response depending on whether the plants were watered with distilled or tap water. This amounts to adding an extra parameter, which increases the likelihood, but by a statistically-insignificant amount (p. 97). This ignores, however, the question of whether there is enough data, precisely-enough measured, to notice a difference — i.e., the power to detect effects. Of course, a sufficiently small effect would always be insignificant, but this is why we have confidence intervals, so that we can distinguish between parameters which are precisely known to be near zero, and those about which we know squat. (Actually, using a confidence interval for the difference in slopes would fit better with the general ideas laid out here in chapter 3.) If we're going to talk about scientific method, then we need to talk about ruling out alternatives (as in, e.g., Kitcher), and so about power and severity (as in Mayo).
This brings me to lies-told-to-children. Critical values of likelihood ratio tests, under the standard asymptotic assumptions, are given in Table 3.2, for selected confidence levels and numbers of parameters. The reader is not told where these numbers come from (\( \chi^2 \) distributions), so they are given no route to figure out what to do in cases which go beyond the table. What is worse, from my point of view, is that they are given no rationale at all for where the table comes from (\( \chi^2 \) here falls out from Gaussian fluctuations of estimates around the truth, plus a second-order Taylor expansion), or why the likelihood ratio test works as it does, or even a hint that there are situations where the usual asymptotics will not apply. Throughout, confidence intervals and the like are stated based on Gaussian (or, as the book puts it, capital-N "Normal") approximations to sampling distributions, without any indication to the reader as to why this is sound, or when it might fail. (The word "bootstrap" does not appear in the index, and I don't think they use the concept at all.) Despite their good intentions, they are falling back on rituals.
Diggle and Chetwynd are both very experienced both applied statisticians and as teachers of statistics. They know better in their professional practice. I am sure that they teach their statistics students better. That they don't teach the readers of this book better is a real lost opportunity.
Disclaimer: I may turn my own data analysis notes into a book, which would to some degree compete with this one.
*: For the record: exploratory data analysis and visualization, motivated by gene expression microarrays; experimental design, motivated by agricultural and clinical field trials; comparison of means, motivated by comparing drugs; regression modeling, motivated by experiments on the effects of pollution on plant growth; survival analysis, motivated by kidney dialysis; time series, motivated by weather forecasting; and spatial statistics, motivated by air pollution monitoring.
Geoffrey Grimmett and David Stirzaker, Probability and Random Processes, 3rd edition
This is still my favorite stochastic processes textbook. My copy of the second edition, which has been with me since graduate school, is falling apart, and so I picked up a new copy at JSM, and of course began re-reading on the plane...
It's still great: it strikes a very nice balance between accessibility and mathematical seriousness. (There is just enough shown of measure-theoretic probability that students can see why it will be useful, without overwhelming situations where more elementary methods suffice.) It's extremely sound at focusing on topics which are interesting because they can be connected back to the real world, rather than being self-referential mathematical games. The problems and exercises are abundant and well-constructed, on a wide range of difficulty levels. (They are now available separately, with solutions manual, as One Thousand Exercises in Probability.)
I am very happy to see more in this edition on Monte Carlo and on stochastic calculus. (My disappointment that the latter builds towards the Black-Scholes model is irrational, since they're giving the audience what it wants.) Nothing seems to have been dropped from earlier editions.
It does have limitations. It's a book about the mathematics of probabilistic models, and so has little to say about either how one designs such a model in the first place. This may be inevitable, since the tools of model-building must change with the subject matter1. There is also no systematic account here of statistical inference for stochastic processes, but this is so universal among textbooks that it's easier to describe exceptions2. If a fourth edition would fix this, I would regard the book as perfect; instead, it is merely almost perfect.
The implied reader has a firm grasp of calculus (through multidimensional integration) and a little knowledge of linear algebra. They can also read and do proofs. No prior knowledge of probability is, strictly speaking, necessary, though it surely won't hurt. With that background, and the patience to tackle 600 pages of math, I unhesitatingly recommended this as a first book on random processes for advanced undergraduates or beginning graduate students, or for self-study.
1: E.g., tracking stocks and flows of conserved quantities, and making sure they balance, is very useful in physics and chemistry, and even some parts of biology. But it's not very useful in the social sciences, since hardly any social or economic variables of any interest are conserved. (I had never truly appreciated Galbraith's quip that "The process by which banks create money is so simple that the mind is repelled" until I tried to explain to an econophysicist that money is not, in fact, a conserved quantity.) And so on. ^
2: The best exception I've seen is Peter Guttorp's Stochastic Modeling of Scientific Data. It's a very good introduction to stochastic processes and their inference for an audience who already knows some probability, and statistics for independent data; it also talks about model-building. But it doesn't have the same large view of stochastic processes as Grimmett and Stirzaker's book, or the same clarity of exposition. Behind that, there is Bartlett's Stochastic Processes, though it's now antiquated. From a different tack, Davison's Statistical Models includes a lot on models of dependent data, but doesn't systematically go into the theory of such processes. ^

Books to Read While the Algae Grow in Your Fur; Enigmas of Chance; Scientifiction and Fantastica; Afghanistan and Central Asia; Biology; Writing for Antiquity

"Dependence Estimation in High-Dimensional Euclidean Spaces" (Next Week at the Statistics Seminar)

$
0
0

For the first seminar of the new academic year, we are very pleased to welcome —

Barnabas Pcozos, "Dependence Estimation in High-Dimensional Euclidean Spaces"
Abstract: In this presentation we review some recent results on dependence estimation in high-dimensional Euclidean spaces. We survey several different dependence measures with their estimators and discuss the main difficulties and open problems with a special emphasis on how to avoid the curse of dimensionality. We will also propose a new dependence measure which extends the maximum mean discrepancy to the copula of the joint distribution. We prove that this approach has several advantageous properties. Similarly to Shannon's mutual information, the proposed dependence measure is invariant to any strictly increasing transformation of the marginal variables. This is important in many appications, for example, in feature selection. The estimator is consistent, robust to outliers, and does not suffer from the curse of dimensionality. We derive upper bounds on the convergence rate and propose independence tests too. We illustrate the theoretical contributions through a series of numerical experiments.
Time and place: 4--5 pm on Monday, 10 September 2012, in the Adamson Wing (136) of Baker Hall

As always, the seminar is free and open to the public.

Enigmas of Chance

Flow Control, Looping, Vectorizing (Introduction to Statistical Computing)

$
0
0

Lecture 3: Conditioning the calculation on the data: if; what is truth?; Boolean operators again; switch. Iteration to repeat similar calculations: for and iterating over a vector; while and conditional iteration (reducing for to while); repeat and unconditional iteration, with break to exit loops (reducing while to repeat). Avoiding iteration with "vectorized" operations and functions: the advantages of the whole-object view; some examples and techniques: mathematical operators and functions, ifelse; generating arrays with repetitive structure.

Introduction to Statistical Computing

Book-Chat

$
0
0
Attention conservation notice: As though I don't drone on about technical books too much as it is, pointers to five reviews averaging 2,000+ words each.

Some notes on books which grew too large for the end-of-the-month wrap-ups:

The review of Cox and Donnelly originally ran in American Scientist. The reviews of Moore and Mertens, and of Bühlmann and van de Geer, were things I started last year and only just finished.

Enigmas of Chance; Automata and Calculating Machines

Pareto at Melos

$
0
0

Because Brad DeLong wants to revive a discussion from two years ago about "Hobbesian" tendencies in economics, I am reminded of a truly excellent paper which Brendan O'Connor told me about a few months I ago:

Michele Piccione and Ariel Rubinstein, "Equilibrium in the Jungle", Economic Journal 117 (2007): 883--896 [PDF preprint]
Abstract: In the jungle, power and coercion govern the exchange of resources. We study a simple, stylized model of the jungle that mirrors an exchange economy. We define the notion of jungle equilibrium and demonstrate that a number of standard results of competitive markets hold in the jungle.

The abstract does not do this justice, so I'll quote from the introduction (their italics).

In the typical analysis of an exchange economy, agents are involved in consumption and exchange goods voluntarily when mutually beneficial. The endowments and the preferences are the primitives of the model. The distribution of consumption in society is determined endogenously through trade.

This paper is motivated by a complementary perspective on human interaction. Throughout the history of mankind, it has been quite common (and we suspect that it will remain so in the future) that economic agents, individually or collectively, use power to seize control of assets held by others. The exercise of power is pervasive in every society and takes several forms. ...

We introduce and analyse an elementary model of a society, called the jungle, in which economic transactions are governed by coercion. The jungle consists of a set of individuals, who have exogenous preferences over a bounded set of consumption bundles (their capacity to consume is finite), and of an exogenous ranking of the agents according to their strength. This ranking is unambiguous and known to all. Power, in our model, has a simple and strict meaning: a stronger agent is able to take resources from a weaker agent.

The jungle model mirrors the standard model of an exchange economy. In both models, the preferences of the agents over commodity bundles and the total endowment of goods are given. The distribution of power in the jungle is the counterpart of the distribution of endowments in the market. As the incentives to produce or collect the goods are ignored in an exchange economy, so are the incentives to build strength in the jungle.

We define a jungle equilibrium as a feasible allocation of goods such that no agent would like and is able to take goods held by a weaker agent. We demonstrate several properties that equilibrium allocation of the jungle shares with the equilibrium allocation of an exchange economy. In particular, we will show that under standard assumptions a jungle equilibrium exists and is Pareto efficient. In addition, we will show that there exist prices that 'almost' support the jungle equilibrium as a competitive equilibrium.

Appreciating this requires a little background. There are multiple arguments to be made on behalf of the market system. The one which the mainstream of the discipline of economics likes to emphasize, and to teach, is the "first fundamental theorem of welfare economics". Assume some obviously false conditions about what people want, and still more obvious falsehoods about the institutions they have. (There need to be competitive markets in literally everything anyone might want, for instance.) Then let people trade with each other if they want to, and only if they want to. The market comes to equilibrium when no one wants to trade any more. This equilibrium is a situation where nobody can be made better off (by their own lights) without someone else being made worse off. That is, the equilibrium is "Pareto optimal" or "Pareto efficient". (Actually, the theory almost never considers the actual dynamics of the exchange, for good reasons; it just shows that every equilibrium is Pareto optimal*.) This theorem gets invoked a lot by serious members of the profession in their writings and teaching. I will leave supporting citations and quotations for this point to Mark Blaug's "The Fundamental Theorems of Modern Welfare Economics, Historically Contemplated" (History of Political Economy 39 (2007): 185--207). Suffice it to say that many, I suspect most, economists take this to be a strong argument for why market systems are better than alternatives, why market outcomes should be presumed to be good, etc.

A conventional assault on this is to argue that the result is not robust — since we know that the premises of the theorems are quite false, those theorems don't say anything about the virtues of actually-existing markets. Then one has mathematical questions about whether the results still hold if the assumptions are relaxed (generically, no), and empirical questions about processes of production, consumption and distribution.

What Piccione and Rubinstein have done is quite different. They have shown that the same optimality claims can be made on behalf of the most morally objectionable way of allocating resources. "The jungle" takes the horrible message of the Melian dialogue, that "the strong do what they can and the weak suffer what they must", and turns it into the sole basis of social intercourse**. And yet everything which general equilibrium theory says in favor of the Utopian market system is also true of the rule of force. In this hybrid of Hobbes and Leibniz, the state of nature is both just as repugnant as Hobbes said, and yet also the best of all possible worlds.

Piccione and Rubinstein's paper undermines the usual economists' argument from Pareto optimality, because it shows not just one or two horrible Pareto optima (those are very easy to come up with), but a horrible system which is nonetheless Pareto-efficient. Of course, there are other cases to make on behalf of markets and/or capitalism than the welfare theorems. (But: the jungle is fully decentralized; free from meddling bureaucrats or improving intellectuals; provides strong incentives for individual initiative, innovative thinking, and general self-improvement; and of no higher computational complexity than an Arrow-Debreu economy.) I should add that anyone who actually grasps the textbook accounts of welfare economics should certainly be able to follow this well-written paper. They will also benefit from the very sensible concluding section 5.2 about the rhetoric of economists.

*: The second fundamental theorem is that, roughly speaking, every Pareto optimum is a competitive equilibrium, attainable by a one-time transfer of goods. This does not hold for "the jungle" (section 4.2 of Piccione and Rubinstein).

**: Hobbes — this was originally a conversation about Hobbes — was, of course, highly influenced by Thucydides, going so far as to translate The Peloponnesian War. (His version of the line is "they that have odds of power exact as much as they can, and the weak yield to such conditions as they can get".)

The Dismal Science

Tweaking Resource-Allocation-by-Tweaking (Introduction to Statistical Computing)

Lab: Flow Control and the Urban Economy (Introduction to Statistical Computing)


Writing and Calling Functions (Introduction to Statistical Computing)

$
0
0

Lecture 4: Just as data structures tie related values together into objects, functions tie related commands together into objects. Declaring functions. Arguments (inputs) and return values (outputs). Named arguments, defaults, and calling functions. Interfaces: controlling what the function can see and do; first sketch of scoping rules. The importance of the interface. An example of writing and improving a function, for fitting the model from the last lab. R for examples.

Introduction to Statistical Computing

Writing Multiple Functions (Introduction to Statistical Computing)

$
0
0

Lecture 5: Using multiple functions to solve multiple problems; to sub-divide awkward problems into more tractable ones; to re-use solutions to recurring problems. Value of consistent interfaces for functions working with the same object, or doing similar tasks. Examples: writing prediction and plotting functions for the model from the last lab. Advantages of splitting big problems into smaller ones with their own functions: understanding, modification, design, re-use of work. Trade-off between internal sub-functions and separate functions. Re-writing the plotting function to use the prediction function. Recursion. Example: re-writing the resource allocation code to be more modular and recursive. R for examples.

Introduction to Statistical Computing

Lab: Of Big- and Small- Hearted Cats (Introduction to Statistical Computing)

Hitting Bottom and Calling for a Shovel (Introduction to Statistical Computing)

$
0
0

In which we see how to minimize the mean squared error when there are two parameters, in the process learning about writing functions, decomposing problems into smaller steps, testing the solutions to the smaller steps, and minimization by gradient descent.

Assignment

Introduction to Statistical Computing

"Balancing the Books By Benchmarking: What To Do When Small Area Estimates Just Don't Add Up" (Next Week at the Statistics Seminar)

$
0
0

Next week, we have one of our graduate student seminars, where the speakers are selected, and all the organizing work is done, by our graduate students:

Beka Steorts, "Balancing the Books By Benchmarking: What To Do When Small Area Estimates Just Don't Add Up"
Abstract: Small area estimation has become increasingly popular due to growing demand for such statistics. In order to produce estimates of adequate precision for these small areas, it is often necessary to borrow strength from other related areas. The resulting model-based estimates may not aggregate to the more reliable direct estimates at the higher level, which may be politically problematic. Adjusting model-based estimates to correct this problem is known as benchmarking.
We motivate small area estimation using a shrinkage argument from Efron and Morris (1975) where we are interested in estimating batting averages for baseball players from 1970. After this motivation, we propose a general class of benchmarked Bayes estimators that can be expressed in the form of a Bayesian adjustment applicable to any estimator, linear or nonlinear. We also derive a second set of estimators under an additional constraint that benchmarks the weighted variability. We illustrate this work using U.S. Census Bureau data. Finally, we determine the excess mean squared error (MSE) from constraining the estimates through benchmarking under an empirical Bayes model, and we also find an asymptotically unbiased estimator of this MSE and compare it to the second-order approximation of the MSE of the EB estimator or, equivalently, of the MSE of the empirical best linear unbiased predictor (EBLUP), that was derived by Prasad and Rao (1990). Moreover, using methods similar to those of Butar and Lahiri (2003), we compute a parametric bootstrap estimator of the MSE of the benchmarked EB estimator and compare it to the MSE of the benchmarked EB estimator. Finally, we illustrate our methods using SAIPE data from the U.S. Census Bureau, and in a simulation study.

Enigmas of Chance

Viewing all 67 articles
Browse latest View live




Latest Images