机器学习入门:K-近邻算法

if (classifierResult != datingLabels[i]) : errorCount += 1.0 # 错误记录与处理等 print "the total error rate is: %f" % (errorCount / float(numTestVecs))

然后我们在python环境中通过

reload(kNN)

来重新加载kNN.py模块,然后调用

kNN.datingClassTest()

得到结果:

the classifier came back with: 3, the real answer is: 3the classifier came back with: 2, the real answer is: 2the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1the classifier came back with: 1, the real answer is: 1...the classifier came back with: 3, the real answer is: 3the classifier came back with: 3, the real answer is: 3the classifier came back with: 2, the real answer is: 2the classifier came back with: 1, the real answer is: 1the classifier came back with: 3, the real answer is: 1the total error rate is: 0.050000

所以我们看到,数据集的错误率是5%,这里会有一定的偏差,因为我们随机选取的数据可能会不同。

  • 使用算法

我们使用上面建立好的分类器构建一个可用的系统,通过输入这些特征值帮她预测喜欢程度。我们来编写代码:

def classifyPerson() :    resultList = ['not', 'small doses', 'large does']    percentTats = float(raw_input("percent of time spent>"))    miles = float(raw_input("flier miles per year?"))    ice = float(raw_input("liters of ice-cream?"))    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')    normMat, ranges, minVals = autoNorm(datingDataMat)    inArr = array([miles, percentTats, ice])    classifierResult = classify0((inArr - minVals) / ranges, normMat, datingLabels, 3)    print "you will like this person: ", resultList[classifierResult - 1]   

这里的代码大家比较熟悉了,就是加入了raw_input用于输入,我们来看结果:

>>> reload(kNN)<module 'kNN' from 'kNN.py'>>>> kNN.classifyPerson()percent of time spent>?10flier miles per year?10000liters of ice-cream?0.5you will like this person:  small doses

我们在做近邻算法的时候也发现,并没有做训练算法这一环节,因为我们不需要训练,直接计算就好了。

同时我们也发现一个问题,k-近邻算法特别慢,它需要对每一个向量进行距离的计算,这该怎么优化呢?