然后我们在python环境中通过
reload(kNN)
来重新加载kNN.py模块,然后调用
kNN.datingClassTest()
得到结果:
the classifier came back with: 3, the real answer is: 3the classifier came back with: 2, the real answer is: 2the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1...the classifier came back with: 3, the real answer is: 3the classifier came back with: 3, the real answer is: 3the classifier came back with: 2, the real answer is: 2the classifier came back with: 1, the real answer is: 1the classifier came back with: 3, the real answer is: 1the total error rate is: 0.050000
所以我们看到,数据集的错误率是5%,这里会有一定的偏差,因为我们随机选取的数据可能会不同。
- 使用算法
我们使用上面建立好的分类器构建一个可用的系统,通过输入这些特征值帮她预测喜欢程度。我们来编写代码:
def classifyPerson() : resultList = ['not', 'small doses', 'large does'] percentTats = float(raw_input("percent of time spent>")) miles = float(raw_input("flier miles per year?")) ice = float(raw_input("liters of ice-cream?")) datingDataMat, datingLabels = file2matrix('datingTestSet2.txt') normMat, ranges, minVals = autoNorm(datingDataMat) inArr = array([miles, percentTats, ice]) classifierResult = classify0((inArr - minVals) / ranges, normMat, datingLabels, 3) print "you will like this person: ", resultList[classifierResult - 1]
这里的代码大家比较熟悉了,就是加入了raw_input用于输入,我们来看结果:
>>> reload(kNN)<module 'kNN' from 'kNN.py'>>>> kNN.classifyPerson()percent of time spent>?10flier miles per year?10000liters of ice-cream?0.5you will like this person: small doses
我们在做近邻算法的时候也发现,并没有做训练算法
这一环节,因为我们不需要训练,直接计算就好了。
同时我们也发现一个问题,k-近邻算法特别慢,它需要对每一个向量进行距离的计算,这该怎么优化呢?