Wednesday, June 30th, 2010
Techdirt has details of how some statistics experts determined that a company called R2K was faking their polls for the Daily Kos
The first thing they noticed was that when R2K did polls that tested how men and women viewed certain politicians or political parties (favorable/unfavorable) there was an odd pattern: if the percentage of men that rated a particular politician favorable or unfavorable was an even number, so was the the percentage of female raters. It seemed like these two points always matched up. If the male percentage was even the female percentage was even. If the male percentage was odd, the female percentage was odd. Yet, as you should know, these are independent variables, not influenced by each other. That 34% of men find a particular politician favorable should have no bearing on why an even percentage of women find that politician favorable. In fact, this happened in almost every such poll that R2K did, to such a level as to suggest it being as close to impossible as you can imagine.
I love stories of detecting information inside of information. I think that is why I liked the Lisbeth Salander books so much.
Posted in General, statistics | No Comments »
Thursday, April 28th, 2005
How do you monitor a population for a disease outbreak? An article on Public Library of Science – medicine , a peer-reviewed open access journal, goes into the math of a new method by a guy named Martin Kulldorff. I found this via Focus, news about Harvard Medical and Public Health research, which I found via someone who left a comment on this weblog. Here, I try to put into my own words what the paper said, in order to more fully understand what they are talking about. This is also an experiment in "using a weblog to make it seem like you know more about something that you actually do". Any corrections are welcome. LINK to article
- Most methods look at spikes in cases over time. As the paper says, this is good for seeing a spike across a geographic area, but not good for zooming in and noticing a local outbreak within the area being studied.
- If you try to monitor localities side by side to notice a spike over time in one of them, you run into the problem of "multiple testing". I think this boils down to counting the same person twice.
- Something called the "Scan Statistic" can help with the problem of multiple testing. This sounds to me like moving a little scanning window on one axis, "time" and if you spot a signal, you move the window sideways across the "space" axis for that same point in time to see if the signal goes away or not.
- Until now, all scan statistics needed a control group, data about the general population so they can know if, for example, ten cases of some disease means something is happening, or if it is just to be expected given the population you are dealing with. Data about the population at large, or "the population at risk" as the article calls them, is often not available.
- If that background data is not available, what do you do? Compare this week’s disease rate with last week’s? According to the paper, this introduces a lot of what I would call feedback. Random noise in one week affects the outcome of the next week and so on.
- So, what is a maverick public health investigator to do? ….meet the "space-time permutation scan statistic". This reads to me like they are doing something fancy with the scanning window described in part c. When I try to read how they are doing this, it all comes apart and resolves itself as a mechanical bird drinking out of a cup of water, but it needs a way to come up with expected numbers, so it uses probability. It says, "lets go sideways over space to gather a baseline for a particular day and then go up and down over time to gather a baseline for a particular area, then boil those in a pot together and get our expected numbers." The math is described in the article.
- Adjust for multiple testing. Without the background population data, they had to come up with a way to account for multiple testing. They did this by shuffling the data they had and then analyzing it again to see if they got the same signal to noise ratio? This is called Monte Carlo Hypothesis testing.
- Sounds expensive. Open Source Software to the rescue! Municipalities can use the public domain SaTScan software (http://www.satscan.org/) to do this number crunching.
Posted in Computers, GIS, General, health, open source, science, statistics | 1 Comment »