日撸 Java 三百行学习笔记day51-53

kNN 分类器

KNN分类器是基于记忆的机器学习模型的一个例子。

这意味着这个模型会记住训练示例，然后他们用它来分类以前从未见过的对象。

KNN的这样的基本思想有点类似于生活中的“物以类聚。人以群分”。

在KNN学习中，首先计算待分类数据特征与训练数据特征之间的距离并排序。取出距离近期的K个训练数据特征（KNN分类器的k是为了预测一个新的测试实例而检索的训练样本数）。然后根据这K个相近训练数据特征所属类别来判定新样本类别：假设它们都属于一类，那么新的样本也属于这个类；否则，对每一个候选类别进行评分，依照某种规则确定新的样本的类别。

内容不少，我们分开解析：

前面的就是各种成员变量，包括两种距离，以前也是没有了解过，这里先说明一下

Manhattan distance和Euclidean distance：

	public double distance(int paraI, int paraJ) {
		double resultDistance = 0;
		double tempDifference;
		switch (distanceMeasure) {
		case MANHATTAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
				if (tempDifference < 0) {
					resultDistance -= tempDifference;
				} else {
					resultDistance += tempDifference;
				} // Of if
			} // Of for i
			break;

		case EUCLIDEAN:
			for (int i = 0; i < dataset.numAttributes() - 1; i++) {
				tempDifference = dataset.instance(paraI).value(i) - dataset.instance(paraJ).value(i);
				resultDistance += tempDifference * tempDifference;
			} // Of for i
			break;
		default:
			System.out.println("Unsupported distance measure: " + distanceMeasure);
		}// Of switch

		return resultDistance;
	}// Of distance

Manhattan distance就是距离的绝对值累加；

Euclidean distance距离的平方累加，比较正规的情况下是需要开根号的，但是这里的不开平方一样的，平方大开根号后大小对比起来还是不变，比的是距离嘛。

为了打乱顺序，还使用到这样一种方式，通过多次对换位置来打乱数组里内容（存放的下标）的顺序：

	public static int[] getRandomIndices(int paraLength) {
		int[] resultIndices = new int[paraLength];

		// Step 1. Initialize.
		for (int i = 0; i < paraLength; i++) {
			resultIndices[i] = i;
		} // Of for i

		// Step 2. Randomly swap.
		int tempFirst, tempSecond, tempValue;
		for (int i = 0; i < paraLength; i++) {
			// Generate two random indices.
			tempFirst = random.nextInt(paraLength);
			tempSecond = random.nextInt(paraLength);

			// Swap.
			tempValue = resultIndices[tempFirst];
			resultIndices[tempFirst] = resultIndices[tempSecond];
			resultIndices[tempSecond] = tempValue;
		} // Of for i

		return resultIndices;
	}// Of getRandomIndices

比较核心的就是间址的灵活使用: trainingSet 和 testingSet 都是整数数组, 表示下标.

然后再划分训练集和测试集，两个集合互补：

	public void splitTrainingTesting(double paraTrainingFraction) {
		int tempSize = dataset.numInstances();
		int[] tempIndices = getRandomIndices(tempSize);
		int tempTrainingSize = (int) (tempSize * paraTrainingFraction);

		trainingSet = new int[tempTrainingSize];
		testingSet = new int[tempSize - tempTrainingSize];

		for (int i = 0; i < tempTrainingSize; i++) {
			trainingSet[i] = tempIndices[i];
		} // Of for i

		for (int i = 0; i < tempSize - tempTrainingSize; i++) {
			testingSet[i] = tempIndices[tempTrainingSize + i];
		} // Of for i
	}// Of splitTrainingTesting

接下来就是最重要的预测部分，无参和有参两种形式，里面又包含了计算邻居和投票：

	public void predict() {
		predictions = new int[testingSet.length];
		for (int i = 0; i < predictions.length; i++) {
			predictions[i] = predict(testingSet[i]);
		} // Of for i
	}// Of predict

	
	public int predict(int paraIndex) {
		int[] tempNeighbors = computeNearests(paraIndex);
		int resultPrediction = simpleVoting(tempNeighbors);

		return resultPrediction;
	}// Of predict


public int[] computeNearests(int paraCurrent) {
		int[] resultNearests = new int[numNeighbors];
		boolean[] tempSelected = new boolean[trainingSet.length];
		double tempMinimalDistance;
		int tempMinimalIndex = 0;

		// Compute all distances to avoid redundant computation.
		double[] tempDistances = new double[trainingSet.length];
		for (int i = 0; i < trainingSet.length; i ++) {
			tempDistances[i] = distance(paraCurrent, trainingSet[i]);
		}//Of for i
		
		// Select the nearest paraK indices.
		for (int i = 0; i < numNeighbors; i++) {
			tempMinimalDistance = Double.MAX_VALUE;

			for (int j = 0; j < trainingSet.length; j++) {
				if (tempSelected[j]) {
					continue;
				} // Of if

				if (tempDistances[j] < tempMinimalDistance) {
					tempMinimalDistance = tempDistances[j];
					tempMinimalIndex = j;
				} // Of if
			} // Of for j

			resultNearests[i] = trainingSet[tempMinimalIndex];
			tempSelected[tempMinimalIndex] = true;
		} // Of for i

		System.out.println("The nearest of " + paraCurrent + " are: " + Arrays.toString(resultNearests));
		return resultNearests;
	}// Of computeNearests

	
	public int simpleVoting(int[] paraNeighbors) {
		int[] tempVotes = new int[dataset.numClasses()];
		for (int i = 0; i < paraNeighbors.length; i++) {
			tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()]++;
		} // Of for i

		int tempMaximalVotingIndex = 0;
		int tempMaximalVoting = 0;
		for (int i = 0; i < dataset.numClasses(); i++) {
			if (tempVotes[i] > tempMaximalVoting) {
				tempMaximalVoting = tempVotes[i];
				tempMaximalVotingIndex = i;
			} // Of if
		} // Of for i

		return tempMaximalVotingIndex;
	}// Of simpleVoting

其中核心当然是找邻居和投票了，邻居的个数是自己设定的，双重for循环，外层是要找的邻居个数，内层则是代表要在训练集里面找而且要最近的那个，依次排列到数组resultNearests里面。投票就是在所选中的邻居里选择“支持率”最高的，也就是这一点：

for (int i = 0; i < paraNeighbors.length; i++) {
			tempVotes[(int) dataset.instance(paraNeighbors[i]).classValue()]++;
		} // Of for i

最后当然就是看看你的效果如何了，看看你的准确度了：

	public double getAccuracy() {
		// A double divides an int gets another double.
		double tempCorrect = 0;
		for (int i = 0; i < predictions.length; i++) {
			if (predictions[i] == dataset.instance(testingSet[i]).classValue()) {
				tempCorrect++;
			} // Of if
		} // Of for i

		return tempCorrect / testingSet.length;
	}// Of getAccuracy

代码有点小长，但是呢理清逻辑就显得明白许多，各种方法的调用，成员变量，还有间址的使用，当时听了课还是有很多没有理顺，自己下来静心看了边代码，再看了录屏才觉得豁然开朗了某些部分。因为之前对Knn确实没有什么了解，所以显得不那么顺手，接下来会多多了解参悟，机器学习感觉是对未知的探寻，很有意思。

日撸 Java 三百行学习笔记day51-53

Java相关栏目本月热门文章