- 数据集下载
- 从sklearn库中获取数据集
机器学习用到数据集都在UCI上面,做个笔记方便自己找。
UCI官网(老版本):https://archive.ics.uci.edu/ml/index.php
UCI官网(新版本):https://archive-beta.ics.uci.edu/
数据集下载下面这些数据的下载地址都是老官网。
鸢尾花数据集:https://archive.ics.uci.edu/ml/datasets/Iris
红酒数据集:https://archive.ics.uci.edu/ml/datasets/Wine
波士顿房价数据集:https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
隐形眼镜数据集:https://archive.ics.uci.edu/ml/datasets/lenses
患疝气病马的数据集:http://archive.ics.uci.edu/ml/datasets/Horse+Colic
葡萄牙银行机构营销案例数据集:http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
1984年美国国会投票的数据集:http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records
发现毒蘑菇相似特征的数据集:https://archive.ics.uci.edu/ml/datasets/mushroom
另外几个是kaggle上的数据集(如果不登录还没法下):
旧金山犯罪案例:https://www.kaggle.com/c/sf-crime
泰坦尼克幸存者预测:https://www.kaggle.com/c/titanic/data
手写数字识别:https://www.kaggle.com/c/digit-recognizer/data
学到后期发现的,原来有些数据在sklearn中有,调函数就能获取。省事多了。但好像只有12个。获取到的数据是JSON形式的,代码演示的是红酒数据集。
- wine:一个JSON形式的数据
- wine.data:数据
- wine.feature_names:每一列特征的名称
- wine.target:所属类型
- wine.target_names:类型的名称
如果将wine.data与wine.target拼接成Dataframe,
那么它会是 [178 rows x 14 columns] 0~13都是特征 14列是标签 wine.feature_names+‘种类’ 可以做它的列名
from sklearn.datasets import load_boston,load_wine,load_iris,load_breast_cancer
import pprint
boston = load_boston()
wine = load_wine()
iris = load_iris()
BreastCancer = load_breast_cancer()
pprint.pprint(wine)
'''
打印结果;
"D:Programming SoftwarePython3.9.1python.exe" "D:/Program Space/Python/sklearn_machinelearning/src/Test/main.py"
{'DESCR': '.. _wine_dataset:n'
'n'
'Wine recognition datasetn'
'------------------------n'
'n'
'**Data Set Characteristics:**n'
'n'
' :Number of Instances: 178 (50 in each of three classes)n'
' :Number of Attributes: 13 numeric, predictive attributes and '
'the classn'
' :Attribute Information:n'
' tt- Alcoholn'
' tt- Malic acidn'
' tt- Ashn'
'tt- Alcalinity of ash n'
' tt- Magnesiumn'
'tt- Total phenolsn'
' tt- Flavanoidsn'
' tt- Nonflavanoid phenolsn'
' tt- Proanthocyaninsn'
'tt- Color intensityn'
' tt- Huen'
' tt- OD280/OD315 of diluted winesn'
' tt- Prolinen'
'n'
' - class:n'
' - class_0n'
' - class_1n'
' - class_2n'
'ttn'
' :Summary Statistics:n'
' n'
' ============================= ==== ===== ======= =====n'
' Min Max Mean SDn'
' ============================= ==== ===== ======= =====n'
' Alcohol: 11.0 14.8 13.0 0.8n'
' Malic Acid: 0.74 5.80 2.34 1.12n'
' Ash: 1.36 3.23 2.36 0.27n'
' Alcalinity of Ash: 10.6 30.0 19.5 3.3n'
' Magnesium: 70.0 162.0 99.7 14.3n'
' Total Phenols: 0.98 3.88 2.29 0.63n'
' Flavanoids: 0.34 5.08 2.03 1.00n'
' Nonflavanoid Phenols: 0.13 0.66 0.36 0.12n'
' Proanthocyanins: 0.41 3.58 1.59 0.57n'
' Colour Intensity: 1.3 13.0 5.1 2.3n'
' Hue: 0.48 1.71 0.96 0.23n'
' OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71n'
' Proline: 278 1680 746 315n'
' ============================= ==== ===== ======= =====n'
'n'
' :Missing Attribute Values: Nonen'
' :Class Distribution: class_0 (59), class_1 (71), class_2 (48)n'
' :Creator: R.A. Fishern'
' :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)n'
' :Date: July, 1988n'
'n'
'This is a copy of UCI ML Wine recognition datasets.n'
'https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.datan'
'n'
'The data is the results of a chemical analysis of wines grown in '
'the samen'
'region in Italy by three different cultivators. There are thirteen '
'differentn'
'measurements taken for different constituents found in the three '
'types ofn'
'wine.n'
'n'
'Original Owners: n'
'n'
'Forina, M. et al, PARVUS - n'
'An Extendible Package for Data Exploration, Classification and '
'Correlation. n'
'Institute of Pharmaceutical and Food Analysis and Technologies,n'
'Via Brigata Salerno, 16147 Genoa, Italy.n'
'n'
'Citation:n'
'n'
'Lichman, M. (2013). UCI Machine Learning Repositoryn'
'[https://archive.ics.uci.edu/ml]. Irvine, CA: University of '
'California,n'
'School of Information and Computer Science. n'
'n'
'.. topic:: Referencesn'
'n'
' (1) S. Aeberhard, D. Coomans and O. de Vel, n'
' Comparison of Classifiers in High Dimensional Settings, n'
' Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. '
'of n'
' Mathematics and Statistics, James Cook University of North '
'Queensland. n'
' (Also submitted to Technometrics). n'
'n'
' The data was used with many others for comparing various n'
' classifiers. The classes are separable, though only RDA n'
' has achieved 100% correct classification. n'
' (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed '
'data)) n'
' (All results using the leave-one-out technique) n'
'n'
' (2) S. Aeberhard, D. Coomans and O. de Vel, n'
' "THE CLASSIFICATION PERFORMANCE OF RDA" n'
' Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. '
'of n'
' Mathematics and Statistics, James Cook University of North '
'Queensland. n'
' (Also submitted to Journal of Chemometrics).n',
'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
1.065e+03],
[1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
1.050e+03],
[1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
1.185e+03],
...,
[1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
8.350e+02],
[1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
8.400e+02],
[1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
5.600e+02]]),
'feature_names': ['alcohol',
'malic_acid',
'ash',
'alcalinity_of_ash',
'magnesium',
'total_phenols',
'flavanoids',
'nonflavanoid_phenols',
'proanthocyanins',
'color_intensity',
'hue',
'od280/od315_of_diluted_wines',
'proline'],
'frame': None,
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2]),
'target_names': array(['class_0', 'class_1', 'class_2'], dtype='


