UCI Machine Learning Repository Content Summary
Abalone Database
- Donated by Sam Waugh
- Predicting the age of abalone from physical measurements
- Documentation: On everything
- 4177 instances, 8 attributes (one nominal)
- No missing attribute values
- Ftp
Access
Adult Database
- Donated by Ron Kohavi
- Predicting whether income exceeds $50K/yr based on census data
- Documentation: On everything
- 48842 instances, 14 attributes (6 continuous and 8 nominal)
- Missing attribute values
- Originally listed as the "Census Income" Database. It was renamed because
it is cited as the "Adult" database
- Ftp
Access
Annealing Database
- Documentation: On everything except database statistics
- Background information on this database: unknown
- Many missing attribute values
- Ftp
Access
Anonymous Microsoft Web Data Database
- Title: Log of anonymous users of the site www.microsoft.com
- Donated by: Jack S. Breese, David Heckerman, Carl M. Kadie
- Number of Instances: Training: 32711 Testing: 5000
- Each instance represents an anonymous, randomly selected user of the web
site.
- Number of Attributes: 294
- Ftp
Access
Arrhythmia Database
- Documentation: On everything
- The aim is to distinguish between the presence and absence of cardiac
arrhythmia and to classify it in one of the 16 groups.
- 16 classes
- 452 examples
- 279 attributes, 206 numeric
- Some missing attribute values
- Ftp
Access
Artificial Characters Database
- Artificially generated using a first order theory (which describes the
structure of ten capitol letters) and random choice theorem prover
- Domain Theory included
- Ftp
Access
Audiology Databases
- Original Version
- From Baylor College
- Documentation: On everything except database statistics
- Non-standardized attributes (differs between instances)
- All attributes are nominally-valued
- Standard Attribute Version of the original
- A standard set of attributes have been defined in terms of the orignal
properties according to a well defined set of rules described in the
documentation files.
- 70 nominally-valued attributes
- Some missing attributes
- Ftp
Access
Auto-Mpg Database
- Revised from CMU StatLib library
- data concerns city-cycle fuel consumption
- Continuously valued class attribute (mpg)
- 398 instances, 5 numeric attributes
- Ftp
Access
Automobile Database
- From 1985 Ward's Automotive Yearbook
- Documentation: On everything except statistics and class distribution
- Good mix of numeric and nominal-valued attributes
- More than 1 attribute can be used as a class attribute in this database
- Ftp
Access
Badges Database
- Donated by Haym Hirsh
- 294 instances, 2 classes
- Instances are described using a sequence of characters (a name)
- Badge problem generated for attendee's to figure out at MLC94
- Ftp
Access
Balance Scale Database
- Donated by Tim Hume
- 625 instances, 4 numeric attributes
- 3 classes (tip right, tip left, balanced)
- No missing values
- Ftp
Access
Balloons Database
- Donated by Michael Pazzani
- Previously used in cognitive psychology experiment
- 16 instances, 2 classes, 4 attributes
- No missing values
- Ftp
Access
- From Ljubljana Oncology Institute
- Documentation: On everything except database statistics
- Well-used database
- 286 instances, 2 classes, 9 attributes + the class attribute
Wisconsin Breast Cancer Databases
- Original database
- Donated by Olvi Mangasarian
- Located in breast-cancer-wisconsin sub-directory, filenames root:
breast-cancer-wisconsin
- Currently contains 699 instances
- 2 classes (malignant and benign)
- 9 integer-valued attributes
- Ftp
Access
- New prognostic database
- Donated 1/96 by Nick Street
- Located in breast-cancer-wisconsin sub-directory, filenames' root: wpbc
- Two possible learning problems: prediciting class (recurrent,
non-recurrent) or time to recur
- 33 numeric attributes
- Ftp
Access
- New diagnostic database
- Donated 1/96 by Nick Street
- Located in breast-cancer-wisconsin sub-directory, filenames' root: wdbc
- Classification learning problems: prediciting class (malignant, benign)
- 30 numeric attributes
- Ftp
Access
Pittsburgh Bridges Database
- Donated by Yoram Reich
- Topic: design knowledge
- 108 instances, 13 attributes (7 specifications, 5 design description, and
1 identifier)
- 2 versions of the data: original and numeric-discretized
- Ftp
Access
Car Evaluation Database
- Donated by Marko Bohanec and Blaz Zupan (see also: Nursery Database)
- Car Evaluation Database was derived from a simple hierarchical decision
model originally developed for the demonstration of DEX (M. Bohanec, V.
Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157,
1990.)
- Because of known underlying concept structure, this database may be
particularly useful for testing constructive induction and structure discovery
methods.
- Classification (4 classes)
- Documentation: On everything
- 1728 instances, 6 nominal ordered attributes
- No missing attribute values
- Ftp
Access
Census Income Database
Chess Databases
- king-rook-vs-king-knight
- Documentation: limited (nothing on class distribution, statistics)
- This concerns king-knight versus king-rook end games
- The database creator is coded in Common Lisp
- king-rook-vs-king-pawn
- Documentation: sufficient
- This concerns king-rook versus king-pawn end games
- Originally described by Alen Shapiro
- king-rook-vs-king
- Donated by Michael Bain and Arthur van Hoff
- 28056 instances, 6 nominal features
- 17 classes to determine optimal depth-of-win
- Six Domain Theories
- Donated by Nick Flann
- In the "domain-theories" sub-directory
- Coded in a dialect of Prolog
- They all generate legal moves of chess
- I haven't yet touched Nick's documentation on them (See README)
- Ftp
Access
Bach Chorales (time-series) Database
- Donated by Darrell Conklin
- Single-line melodies of 100 Bach chorales (originally 4 voices)
- Number of Instances: 100 Chorales, each with ~45 events
- Number of Attributes: 6 (nominal) per event
- Ftp
Access
Connect-4 Opening Database
- Donated/Created by John Tromp
- Contains all legal 8-ply positions in the game of connect-4 in which
neither player has won yet, and in which the next move is not forced
- 67557 instances, 42 nominal attributes
- Ftp
Access
Credit Screening Databases
- Japanese Credit Screening Database
- Includes domain theory
- Positive instances are people who were granted credit
- The theory was generated by talking to Japanese domain experts
- Credit Card Application Approval Database
- Good mix of attributes -- continuous, nominal with small numbers of
values, and nominal with larger numbers of values
- 690 instances, 15 attributes some with missing values
- Ftp
Access
Computer Hardware Database
- From CACM 4/87
- Described in terms of its cycle time, memory size, etc.
- Classified in terms of their relative performance capabilities
- Documentation: complete
- Contains integer-valued concept labels
- All attributes are integer-valued
- Ftp
Access
Contraceptive Method Choice
- Origin: A subset of the 1987 National Indonesia Contraceptive Prevalence
Survey
- Donated by Tjen-Sien Lim (limt@stat.wisc.edu)
- 1473 instances, 2 classes, 10 attributes
- This dataset is a subset of the 1987 National Indonesia Contraceptive
Prevalence Survey. The samples are married women who were either not pregnant
or do not know if they were at the time of interview. The problem is to
predict the current contraceptive method choice (no use, long-term methods, or
short-term methods) of a woman based on her demographic and socio-economic
characteristics.
- Ftp
Access
Covertype data
- Donated by Jock A. Blackard 8/28/98
- 581012 instances, 8 classes, 54 attributes
- Ftp
Access
Cylinder Bands Database
- Donated by Bob Evans 8/95
- Used in decision tree induction for mitigating process delays know as
"cylinder bands" in rotogravure printing
- 512 instances, 2 classes, 19 attributes
- Missing values
- Ftp
Access
Dermatology Database
- Documentation: On everything
- The aim is to determine the type of Eryhemato-Squamous Disease.
- 6 classes
- 366 examples
- 34 attributes, 1 nominal
- Some missing attribute values
- Ftp
Access
Diabetes Data
- From AIM '94
- Non-Uniform Data format
- Time dependencies
- Ftp
Access
The Second Data Generation Program - DGP/2
- Generates instances around peaks and allows for specification of the mean
and standard deviations in the normally distributed data
- Generates application domains based on specific parameters: number of
features, and proportion of positive to negative examples
- Allows for variations in the number of instances, the range of feature
values, the number of peaks, the percent of positive instances desired and a
radius around the peaks that these instances fall within
- Ftp
Access
Document Understanding Database
- Donated by Donato Malerba
- Five concepts, expressed as predicates, to be learned
- mulptiple predicate learning problem
- see .info file for more information
- Ftp
Access
EBL Domain Theories and Examples
- cup
- deductive.assumable (contains three domain theories)
- emotion
- ice
- pople
- safe-to-stack
- suicide
- Ftp
Access
Echocardiogram Database
- From Reed Institute, Miami
- Documentation: sufficient
- 13 numeric-valued attributes
- Binary classification: patient either alive or dead after survival period
- Ftp
Access
Ecoli Database
- Donated by Paul Horton (see also: yeast database)
- Predicting the Cellular Localization Sites of Proteins
- Documentation: On everything
- 336 instances, 8 attributes (one nominal)
- No missing attribute values
- Ftp
Access
Flags Database
- From Collins Gem Guide to Flags, 1986
- 194 instances, mixed numeric- and nominal-valued attributes
- donated by Richard S. Forsyth, creator of PC/BEAGLE
- Ftp
Access
Function Finding Databases
- Donated by Cullen Schafer
- 352 Studies in Function-Finding
- Collected mostly from investigations in physical science
- Intention: Evaluation of function-finding algorithms
- Ftp
Access
Glass Identification Database
- From USA Forensic Science Service
- Documentation: completed
- 6 types of glass
- Defined in terms of their oxide content (i.e. Na, Fe, K, etc)
- All attributes are numeric-valued
- Ftp
Access
Haberman's Survival Data
- Donar: Tjen-Sien Lim (limt@stat.wisc.edu)
- The dataset contains cases from a study that was conducted between 1958
and 1970 at the University of Chicago's Billings Hospital on the survival of
patients who had undergone surgery for breast cancer.
- Ftp
Access
Hayes-Roth Database
- Described in their 1977 paper
- Topic: human subjects study
- Ftp
Access
Heart Disease Databases
- Documentation: extensive
- 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach
- 13 of the 75 attributes were used for prediction in 2 separate tests, each
of which achieved approximately 75%-80% classification accuracy
- The chosen 13 attributes are all continuously valued
- Includes cost data (donated by Peter Turney)
- Ftp
Access
Hepatitis Database
- From G.Gong: CMU
- Documentation: incomplete
- 155 instances with 20 attributes each; 2 classes
- Mostly Boolean or numeric-valued attribute types
- Includes cost data (donated by Peter Turney)
- Ftp
Access
Horse Colic Database
- From Mary McLeish & Matt Cecile
- Well documented attributes
- 368 instances with 28 attributes (continuous, discrete, and nominal)
- 30% missing values
- Ftp
Access
Housing Database (Boston)
- From CMU StatLib Library
- concerns housing prices in suburbs of Boston
- Continuously valued class attribute (MEDV)
- 506 instances, 12 continuous, 1 binary attributes
- Ftp
Access
ICU Data
- From Serdar Uckun (AIM '94)
- Deals with ICU treatment of patients with Adult respiratory distress
syndrome (ARDS)
- Complex dataset (see documentation)
- Ftp
Access
Image segmentation Database
- Donated by Carla Brodley
- Documentation status: Skimpy
- Not previously used in the ml literature as of 8/1991
- Image data described by high-level numeric-valued attributes, 7 classes
- Ftp
Access
Internet Advertisements
- From Nicholas Kushmerick (nick@ucd.ie)
- This dataset represents a set of possible advertisements on Internet
pages. The features encode the geometry of the image (if available) as well as
phrases occuring in the URL, the image's URL and alt text, the anchor text,
and words occuring near the anchor text. The task is to predict whether an
image is an advertisement ("ad") or not ("nonad").
- Number of Instances: 3279 (2821 nonads, 458 ads)
- Number of Attributes: 1558 (3 continous; others binary)
- Ftp
Access
Ionosphere Database
- From V. Sigillito
- Documentation Complete
- 2 classes, 351 instances, 34 numeric attributes, no missing values
- Classification of radar returns from the ionosphere
- Ftp
Access
Iris Plant Database
- From Fisher, 1936
- Documentation: complete
- 3 classes, 4 numeric attributes, 150 instances
- 1 class is linearly separable from the other 2, but the other 2 are not
linearly separable from each other (simple database)
- Ftp
Access
Isolet Spoken Letter Recognition Database
- From Ron Cole and Mark Fanty
- 6238 + 1559 instances, 26 classes (one for each letter)
- All attributes are real-valued scaled from -1.0 to 1.0.
- No missing values
- Ftp
Access
Kinship Database
- From Hinton 1986 & Quinlan 1989
- Relational
- 24 individuals, 12 relations
- 104 instances derivable
- Case studies have been reported by both authors
- Ftp
Access
Labor relations Database
- From Collective Bargaining Review
- Documentation: no statistics
- Please see the labor directory for more information
- Ftp
Access
LED Display Domains
- From Classification and Regression Trees book
- Documentation: sufficient, but missing statistical information
- All attributes are Boolean-valued
- Two versions: 7 and 24 attributes
- Optimal Baye's rate known for the 10% probability of noise problem
- Several ML researchers have used this domain for testing noise tolerancy
- We provide here 2 C programs for generating sample databases
- Ftp
Access
Lenses Database
- Donated by Benoit Julien
- Small database with few attributes
- attributes are either binary- or ternary-valued
- 3 classes: hard contact lenses, soft contact lenses, or neither
- Ftp
Access
Letter Recognition Database
- From David Slate
- Based on various fonts
- 20,000 instances (712565 bytes) (.Z available)
- 17 attributes: 1 class (letter category) and 16 numeric (integer)
- No missing attribute values
- Ftp
Access
Liver-disorders Database
- BUPA Medical Research Ltd. database donated by Richard S. Forsyth
- 7 numeric-valued attributes
- 345 instances (male patients)
- Includes cost data (donated by Peter Turney)
- Ftp
Access
Logic-theorist
- Donated by Paul O'Rorke's (described in Machine Learning)
- All code for LT
- Ftp
Access
Lung Cancer Database
- Donated by Stefan Aeberhard
- 32 instances, 57 Attributes (2 classes)
- No Attribute Definitions
- Ftp
Access
- From Ljubljana Oncology Institute
- Documentation: incomplete
- CITATION REQUIREMENT: Please use (see the documentation file)
- 148 instances; 19 attributes; 4 classes; no missing data values
Mechanical Analysis Data
- Donated by members of the Universita di Torino
- Fault diagnosis problem of electromechanical devices
- ENIGMA system application described in proceedings of MLC-1990
- Each of the 209 instances is described by a different set of components
- PUMPS DATA SET
- Newer version of above dataset with domain theory and results
- Ftp
Access
Meta-data Database
- Donated by J.Gama
- Meta-Data was used in order to give advice about which classification
method is appropriate for a particular dataset (taken from the results of the
Statlog project).
- 528 instances; 22 attributes; numeric prediction; missing values
- Ftp
Access
Mobile Robots Database
- Donated by Volker Klingspor, Katharina J. Morik and Anke D. Rieger
- Learning Concepts from Sensor Data of a Mobile Robot
- Multiple levels of learning (from raw sensor data to high level concepts)
- Ftp
Access
Molecular Biology Databases
- Promoter Gene Sequences Database
- Donated by Jude Shavlik; See AAAI-90 Towell, Shavlik, & Noordewier
- E. Coli promoter gene sequences (DNA) with partial domain theory
- 106 instances, each predictor attribute takes on one of four values
- 50% positive instances
- Splice-junction Gene Sequences Database
- Donated by Geoffrey Towell, Noordewier, & Shavlik
- categories "ei" and "ie" include every "split-gene" for primates in
Genbank 64.1
- non-splice examples taken from sequences known not to include a splicing
site
- 3190 instances with classes "ei" (25%), "ie" (25%) and Neither (50%)
- Domain theory included
- Protein Secondary Structure Database
- Originally created and used by Qian and Sejnowski
- From CMU connectionist bench repository
- Classifies secondary structure of certain globular proteins
- 3 classes: alpha-helix, beta-sheet and random-coil
- Protein Secondary Structure Domain Theory
- Donated and created by Jude Shavlik & Rich Maclin
- Imperfect domain theory for Qian and Sejnowski Protein Secondary
Structure database (above)
- Closely implements the algorithm of Chou and Fasman
- Ftp
Access
MONK's Problems
- Donated by Sebastian Thrun
- A set of three artificial domains over the same attribute space
- 6 nominally values attributes, no missing values
- 1 problems has class noise added
- Used to test a wide range of induction algorithms
- Ftp
Access
Moral Reasoner Database
- Donated by James Wogulis
- Horn-clause model that qualitatively simulates moral reasoning
- 202 instances and theory
- Theory includes negated literals
- Ftp
Access
Multiple Features Database
- From Robert P.W. Duin
- This dataset consists of features of handwritten numerals (`0'--`9')
extracted from a collection of Dutch utility maps.
- 200 patterns per class (for a total of 2,000 patterns) have been digitized
in binary images.
- Digits are represented in terms of Fourier coefficients, profile
correlations, Karhunen-Love coefficients,pixel averages,Zernike moments and
morphological features.
- Number of Instances: 2000 (200 per class)
- Number of Attributes: 649
- Number of Classes:10
- Ftp
Access
Mushrooms Database
- From Audobon Society Field Guide
- Documentation: complete, but missing statistical information
- Described in terms of physical characteristics
- Classification: poisonous or edible
- All attributes are nominal-valued
- Large database: 8124 instances (2480 missing values for attribute #12)
- Ftp
Access
MUSK Databases
- Donated by Tom Dietterich
- Task: to classify if musk molecule
- Two datasets: 476 and 6,598 instances, 168 attributes
- Was used to explore "multiple instance problem"
- Ftp
Access
Nursery Database
- Donated by Marko Bohanec and Blaz Zupan (see also: Car Evaluation
Database)
- Nursery Database was derived from a hierarchical decision model originally
developed to rank applications for nursery schools.
- Classification (5 classes)
- Because of known underlying concept structure, this database may be
particularly useful for testing constructive induction and structure discovery
methods.
- Documentation: On everything
- 12960 instances, 8 nominal attributes
- No missing attribute values
- Ftp
Access
Othello Domain Theory
- Written and donated by Tom Fawcett
- Coded in Prolog
- Used in research to generate features for an inductive learning system
- Ftp
Access
Page Blocks Classification Database
- Written and donated by Donato Malerba
- The problem consists of classifying all the blocks of the page layout of a
document that has been detected by a segmentation process. This is an
essential step in document analysis.
- 5473 examples comes from 54 distinct documents
- All attributes are numeric
- Ftp
Access
Pima Indians Diabetes Database
- From National Institute of Diabetes and Digestive and Kidney Diseases
- Binary classes (tested positive or negative for diabetes)
- All 8 attributes are numeric-valued
- 768 instances
- Includes cost data (donated by Peter Turney)
- Ftp
Access
Optical Recognition of Handwritten Digits
- From E. Alpaydin, C. Kayna
- 10 classes
- 3823 training, 1797 test cases
- 64 attributes (All input attributes are integers 0..16)
- Ftp
Access
Pen-Based Recognition of Handwritten Digits
- From E. Alpaydin, Fevzi Alimoglu
- 10 classes
- 7494 training cases, 3498 test cases
- 16 attributes (All input attributes are integers 0..100)
- Ftp
Access
Postoperative Patient Database
- From Jerzy W. Grzymala-Busse
- 3 classes
- 90 instances
- 8 attributes, one numeric with missing values
- Ftp
Access
- From Ljubljana Oncology Institute
- Documentation: incomplete
- CITATION REQUIREMENT: Please use (see the documentation file)
- 339 instances; 18 attributes; 22 classes; lots of missing data values
Qualitative Structure Activity Relationships (QSARs)
- Donated by Ross King
- Two sets of dataset are given: pyrimidines and triazines
- 3 representations: ILP, Propositional Machine Learning Discrimination, and
Propositional Machine Learning Regression
- Ftp
Access
Quadraped Animals Data Generator
- Donated by John H. Gennari
- Structured data; each instance has 9 components, with 9 numeric-valued
attributes per component
- 4 classes
- Previously used to evaluate unsupervised learning algorithms
- Ftp
Access
Servo Database
- Donated by Ross Quinlan
- numerically valued class attribute
- 4 nominal attributes; 167 instances
- covers an extremely non-linear phenomenon
- Ftp
Access
Shuttle Landing Control Database
- Tiny, 15-instance database with 7 attributes per instance; 2 classes
- Instances have don't care values for some features (database may be
expanded to 277 instances)
- Ftp
Access
Solar Flare Databases
- From Gary Bradshaw
- 1389 instances, 13 attributes (includes 3 class attributes)
- Each class attribute counts the number of solar flares of a certain class
that occur in a 24 hour period
- Prediction attributes are nominal; no missing values
- Ftp
Access
Soybean Databases
- Donated by Michalski
- Documentation: Only the statistics is missing
- (2 sizes)
- Michalski's famous soybean disease databases
- Ftp
Access
Challenger USA Space Shuttle O-Ring Databases
- Donated by David Draper
- 2 small 23-instance databases containing only positive integers
- Fascinating topic: Analysis of launch temperature vs. O-ring stress
- Task: predict the number of O-rings that experience thermal distress on a
flight at 31 degrees F given data on the previous 23 shuttle flights
- Ftp
Access
Low Resolution Spectrometer Database
- From IRAS data -- NASA Ames Research Center
- Documentation: no statistics nor class distribution given
- LARGE database...and this is only 531 of the instances
- 98 attributes per instance (all numeric)
- Contact NASA-Ames Research Center for more information
- Ftp
Access
Spambase Database
- Donated by George Forman (gforman at nospam hpl.hp.com) 650-857-7835 Mark
Hopkins, Erik Reeber and Jaap Suermondt.
- Number of Instances: 4601 (1813 Spam = 39.4%)
- Number of Attributes: 58 (57 continuous, 1 nominal class label)
- The "spam" concept is diverse: advertisements for products/web sites, make
money fast schemes, chain letters, pornography... Our collection of spam
e-mails came from our postmaster and individuals who had filed spam. Our
collection of non-spam e-mails came from filed work and personal e-mails, and
hence the word 'george' and the area code '650' are indicators of non-spam.
These are useful when constructing a personalized spam filter. One would
either have to blind such non-spam indicators or get a very wide collection of
non-spam to generate a general purpose spam filter.
- Ftp
Access
SPECT and SPECTF heart databases
- Donated by Krzysztof J. Cios & Lukasz A. Kurgan (Krys.Cios@cudenver.
edu)
- Documentation: Describes diagnosing of cardiac Single Proton Emission
Computed Tomography (SPECT) images. Each of the patients is classified into
two categories: normal and abnormal.
- 267 image sets (patients) in each dataset
- 23 attributes per instance (22 binary, 1 binary class) in SPECT
- 44 attributes per instance (43 binary, 1 binary class) in SPECTF
- Ftp
Access
Sponge Database
- Donated by Javier Bejar and Ulises Cortes
- Classification of atlantic-mediterranean marine sponges
- 76 instances
- 45 nominal and numeric attributes (some missing values)
- Ftp
Access
Statlog Project Databases
- Donated by Ross King
- Vehicle Silhouettes: 3D objects within a 2D image by application of an
ensemble of shape feature extractors to the 2D silhouettes of the objects.
- Landsat Satellite: multi-spectral values of pixels in 3x3 neighbourhoods
in a satellite image, and the classification associated with the central pixel
in each neighbourhood
- Shuttle: The shuttle dataset contains 9 attributes all of which are
numerical. Approximately 80% of the data belongs to class 1
- Australian Credit Approval: This file concerns credit card applications.
This database exists elsewhere in the repository (Credit Screening Database)
in a slightly different form
- Heart Disease: This dataset is a heart disease database similar to a
database already present in the repository (Heart Disease databases) but in a
slightly different form
- Image Segmentation: This dataset is an image segmentation database similar
to a database already present in the repository (Image segmentation database)
but in a slightly different form.
- German Credit Database: This dataset classifies people described by a set
of attributes as good or bad credit risks. Comes in two formats (one all
numeric). Also comes with a cost matrix
- Ftp
Access
Student Loan Relational Database
- Donated by Michael Pazzani
- Target concept: no_payment_due by person for student loan
- 1000 instances of target concept
- Includes domain theory
- 10+ extensionally and intesionally defined relations
- Ftp
Access
Teaching Assistant Evaluation
- Collected by Wei-Yin Loh (Department of Statistics, UW-Madison)
- Donated by Tjen-Sien Lim (limt@stat.wisc.edu)
- 151 instances, 6 attributes , 3 classes
- The data consist of evaluations of teaching performance over three regular
semesters and two summer semesters of 151 teaching assistant (TA) assignments
at the Statistics Department of the University of Wisconsin-Madison. The
scores were divided into 3 roughly equal-sized categories ("low", "medium",
and "high") to form the class variable.
- Ftp
Access
Tic-Tac-Toe Endgame Database
- Donated by David W. Aha, Turing Institute
- Documentation complete as of Summer 1991
- 958 instances, all attributes can take on 1 of 3 possible values
- Binary classification task (i.e., "win for x")
- A paradigmatic domain for constructive induction studies
- Ftp
Access
Thyroid Disease Database
- From Garavan Institute
- Documentation: as given by Ross Quinlan
- 6 databases from the Garavan Institute in Sydney, Australia
- Approximately the following for each database:
- 2800 training (data) instances and 972 test instances
- Plenty of missing data
- 29 or so attributes, either Boolean or continuously-valued
- 2 additional databases, also from Ross Quinlan, are also here
- Hypothyroid.data and sick-euthyroid.data
- Quinlan believes that these databases have been corrupted
- Their format is highly similar to the other databases
- 1 more database of 9172 instances that cover 20 classes, and a related
domain theory
- Another thyroid database from Stefan Aeberhard
- 3 classes, 215 instances, 5 attributes
- No missing values
- A Thyroid database suited for training ANNs
- 3 classes
- 3772 training instances, 3428 testing instances
- Includes cost data (donated by Peter Turney)
- Ftp
Access
Trains Database
- Donated by David Aha & Eric Bloedorn
- Original owners: R. Michalski & R. Stepp
- 10 instances
- 10 attributes + class (direction: east or west)
- 2 data formats (structured, one-instance-per-line)
- Includes "East-West" competion data and results (donated by Peter Turney)
- Ftp
Access
University Database
- Donated by Steve Souders
- Documentation: scant; we've left it in its original (LISP-readable) form
- 285 instances, including some duplicates
- At least one attribute, academic-emphasis, can have multiple values per
instance
- The user is encouraged to pursue the Lebowitz reference for more
information on the database
- Ftp
Access
Congressional Voting Records Database
- 1984 United Stated Congressional Voting Records
- Classification: Republican or Democrat
- Documentation: completed
- All attributes are Boolean valued; plenty of missing values; 2 classes
- Ftp
Access
Water Treatement Plant Database
- Donated by Javier Bejar and Ulises Cortes
- 38 numeric attributes; 527 instances; missing values
- Multiple classes predict plant state
- Ill-Stuctured Domain
- Ftp
Access
Waveform Data Generator
- From Classification and Regression Trees book
- Documentation: no statistics
- CART book's waveform domains
- 21 and 40 continuous attributes respectively
- difficult concepts to learn, but known Bayes optimal classification rate
of 86% accuracy
- Ftp
Access
Wine Recognition Database
- Donated by Stefan Aeberhard
- Using chemical analysis determine the origin of wines
- 13 attributes (all continuous), 3 classes, no missing values
- 178 instances
- Ftp
Access
Yeast Database
- Donated by Paul Horton (see also: Ecoli database)
- Predicting the Cellular Localization Sites of Proteins
- Documentation: On everything
- 1484 instances, 8 attributes (one nominal)
- No missing attribute values
- Ftp
Access
Zoo Database
- From Richard Forsyth
- Artificial
- 7 classes of animals
- 17 attributes (besides name), 15 Boolean and 2 numeric-valued
- No missing attribute values
- Ftp
Access
Undocumented Databases
- Mike Pazzani's economic sanctions database
- Philippe Collard's database on cloud cover images
- Vince Sigillito's database on dna secondary structure
- Nettalk data (see connectionist-bench)
- Sonar data (see connectionist-bench)
- Vowel data (see connectionist-bench)
- Ftp
Access