在使用python做机器学习时候,为了制作训练数据(training samples)和测试数据(testing samples),常使用sklearn里面的sklearn.model_selection.train_test_split模块。
train_test_split的使用方法:
语法:sklearn.model_selection.train_test_split(arrays, *options)
train_test_split里面常用的因数(arguments)介绍:
- arrays:分割对象同样长度的列表或者numpy arrays,矩阵。
- test_size:两种指定方法。1:指定小数。小数范围在0.0~0.1之间,它代表test集占据的比例。2:指定整数。整数的大小必须在这个数据集个数范围内,总不能指定一个数超出了数据集的个数范围吧。要是test_size在没有指定的场合,可以通过train_size来指定。(两个是对应关系)。如果train_size也没有指定,那么默认值是0.25.
- train_size:和test_size相似。
- random_state:这是将分割的training和testing集合打乱的个数设定。如果不指定的话,也可以通过numpy.random来设定随机数。
- shuffle和straify不常用。straify就是将数据分层。
train_test_split 用法举例:
>>> import pandas as pd >>> from sklearn.model_selection import train_test_split >>> >>> namelist = pd.DataFrame({ ... "name" : ["Suzuki", "Tanaka", "Yamada", "Watanabe", "Yamamoto", ... "Okada", "Ueda", "Inoue", "Hayashi", "Sato", ... "Hirayama", "Shimada"], ... "age": [30, 40, 55, 29, 41, 28, 42, 24, 33, 39, 49, 53], ... "department": ["HR", "Legal", "IT", "HR", "HR", "IT", ... "Legal", "Legal", "IT", "HR", "Legal", "Legal"], ... "attendance": [1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1] ... }) >>> print(namelist) age attendance department name 0 30 1 HR Suzuki 1 40 1 Legal Tanaka 2 55 1 IT Yamada 3 29 0 HR Watanabe 4 41 1 HR Yamamoto 5 28 1 IT Okada 6 42 1 Legal Ueda 7 24 0 Legal Inoue 8 33 0 IT Hayashi 9 39 1 HR Sato 10 49 1 Legal Hirayama 11 53 1 Legal Shimada
将testing数据指定为0.3(test_size=0.3),从而将testing和training 集合分开。
>>> namelist_train, namelist_test = train_test_split(namelist, test_size=0.3) >>> print(namelist_train) age attendance department name 10 49 1 Legal Hirayama 1 40 1 Legal Tanaka 7 24 0 Legal Inoue 2 55 1 IT Yamada 4 41 1 HR Yamamoto 3 29 0 HR Watanabe 9 39 1 HR Sato 6 42 1 Legal Ueda >>> print(namelist_test) age attendance department name 0 30 1 HR Suzuki 8 33 0 IT Hayashi 11 53 1 Legal Shimada 5 28 1 IT Okada
接下来是将testing数据指定为具体数目。test_size=5。
>>> namelist_train, namelist_test = train_test_split(namelist, test_size=5) >>> print(namelist_train) age attendance department name 3 29 0 HR Watanabe 4 41 1 HR Yamamoto 6 42 1 Legal Ueda 1 40 1 Legal Tanaka 9 39 1 HR Sato 8 33 0 IT Hayashi 7 24 0 Legal Inoue >>> print(namelist_test) age attendance department name 2 55 1 IT Yamada 10 49 1 Legal Hirayama 5 28 1 IT Okada 11 53 1 Legal Shimada 0 30 1 HR Suzuki
接下来将training data 指定为0.5(training_size=0.5)
>>> namelist_train, namelist_test = train_test_split(namelist, test_size=None, train_size=0.5) >>> print(namelist_train) age attendance department name 5 28 1 IT Okada 2 55 1 IT Yamada 3 29 0 HR Watanabe 4 41 1 HR Yamamoto 10 49 1 Legal Hirayama 0 30 1 HR Suzuki >>> print(namelist_test) age attendance department name 6 42 1 Legal Ueda 7 24 0 Legal Inoue 9 39 1 HR Sato 11 53 1 Legal Shimada 8 33 0 IT Hayashi 1 40 1 Legal Tanaka
接下来是是shuffle和straify功能。
>>> namelist_train, namelist_test = train_test_split(namelist, shuffle=False) >>> print(namelist_train) age attendance department name 0 30 1 HR Suzuki 1 40 1 Legal Tanaka 2 55 1 IT Yamada 3 29 0 HR Watanabe 4 41 1 HR Yamamoto 5 28 1 IT Okada 6 42 1 Legal Ueda 7 24 0 Legal Inoue 8 33 0 IT Hayashi >>> print(namelist_test) age attendance department name 9 39 1 HR Sato 10 49 1 Legal Hirayama 11 53 1 Legal Shimada
summary
- train_test_split(arrays,options) arrays确定需要分割的对象,数据集。
- train_test_split(arrays,options) options确定需要分割的方法。例如比例,随机性,分层等。