ICPSR Summer Program Lecture Materials
Data Mining
These files are the outlines that I use as guides to each lecture.
Each is given in PDF format. The software used for the computing is
JMP from SAS (though many of the tools are also in R, Stata, SPSS, and
SAS). This syllabus summarizes the
course and lectures.
- Introduction and Exploratory Mining
This introductory lecture talks about the place for data
mining in the social sciences and the differences that
distinguish data mining from classical statistics. The
lecture also introduces JMP and the data set followed in later
examples, the ANES 2004 data from ICPSR. We'll start using
JMP by exploring this large data file in class, using JMP's
plot linking to explore voting behavior and the use of feeling
thermometers.
- Models for Prediction
- Data Mining with Regression
- Lab Session
- Using Regression More Effectively
- Streaming Features and Alpha Investing
- Alternatives: Neural Networks, Classification & Regression Trees
- Classification and Regression Trees
- Lab Session
- Wrapping-Up: Summary and Opportunities
Some data sets to play with...
Bootstrap Resampling
The lecture summaries shown below are copies of the transparencies shown
on the computer and discussed in class. You can also get the software
that accompanies these lectures below. If you cannot find something, take
a second look and if its still not there, send me an e-mail at
stine@wharton.upenn.edu
.
Overview
These overview summarizes most of what is
covered at a more leisurely pace in the following lectures. The linked PDF
file gives the slides that I used in summarizing bootstrap methods in a
seminar at UNC in April, 2000. An extra
postscript file
has the double bootstrap figure used in this overview.
My paper "An Introduction to Bootstrap
Methods" (which appeared in Sociological Methods & Research
back in 1989) introduces you to the ideas of bootstrap resampling
through a variety of examples. The paper includes examples in
regression and illustrates situations in which the bootstrap does not
give the answer you'd like.
Lectures
The lecture notes are in PDF format, so that you will need to have
Adobe Acrobat to view, search, and print the files.
- Syllabus
- The syllabus presents a brief overview of what happens in
each class, along with some review questions. This syllabus also
appears in the introductory program information given to you if
you attended the ICPSR Summer Program.
- Bibliography
- This annotated list of references is not comprehensive and
rather is more representative of what is available on a
wide variety of topics, ranging from how to handle complex
surveys to the methods for time series. Like the syllabus, you
also have this in the information distributed to program participants.
- Lecture Notes
- The file for each lecture is a printed version of the Word files
that I use for each class. I may clean up some errors if I find them,
but they are pretty close to what was used in class. To use the
R scripts that accompany the lecture notes, you'll need to have
installed R on your own system.
- Introduction
( Lecture1.R )
I lost the data on the sample proportions when class
ended on Monday. Darn, and sorry.
- Exploring the Bootstrap
( Lecture2.R )
- The Bootstrap in Simple Regression
and Correlation
( Lecture3.R )
Here are the data sets that we used in this class:
Computing and more sophisticated estimators like robust regression
come up in this class. Fortunately, in his
on-line appendices for his
book An R and S-Plus Companion to Applied Regression
John Fox has discussions of both these. Look for his relevant
Web appendices for the book.
- Multiple Regression
( Lecture4.R )
We used Duncan's data
on occupational prestige for some of these examples.
- More Methods, Flaws, and Intervals
( Lecture5.R )
Software
In case we used some files of commands with a lecture, those files appear
above with the lecture notes. Otherwise, look here.
- AXIS
- In addition to the "raw lisp" software used in class, I also used
the AXIS interface functions. You can get a zip file of the needed
programs and further information about AXIS and Lisp-Stat on my
main web page.
- Lisp-Stat
- The official source for Lisp-Stat is the
software archive prepared by Luke Tierney (author of Lisp-Stat). The
archive is available from the
University of Minnesota Statistics Department .
- R
- The
CRAN archives
have the source for R for various systems,
including unix and windows. The archives also offer quite a
few supplements and documentation.
Other Things
- References to Bootstrap Resampling
- In addition to the
bibliography mentioned
above, I list references to bootstrap resampling used in
social science applications. I don't see too many of these
journals, so any suggestions on your part are appreciated.
No one seems to want to do this, but I'll post them if you
do. Just send me a
mail message .
- Dalgleish, L. I. (1995). Discriminant analysis: statistical
inference using the jackknife and bootstrap procedures.
Psychological Bulletin , Vol 116. (Shows some SAS
routines for testing the size of coefficients.)
- Follow this
link to see web pages describing recent work using
the bootstrap to assess goodness-of-fit measures.