Machine Learning with Spark (KMeans examples) and Scala

This article just my first step in Spark ML.
You can read the short description on wiki - K-means_clustering

First of all, we need to check everything.

1) Test data, our data is a set of points in 2D, each point has 3 properties: X coordinate, Y coordinate, and a color. I am going to make 2 clustering, without and with colors.

I use Oracle SQL (because it's under the hand) to generate the test data. Special select 3 cluster centers and generate random


drop table delit_test_data;

create table delit_test_data as
--cluster 1
select 1 as cluster_number,
       ROUND(2+(case when dbms_random.value<0.5 then -1 else +1 end)*dbms_random.value*1.5,2) as x,
       ROUND(2+(case when dbms_random.value<0.5 then -1 else +1 end)*dbms_random.value*1.5,2) as y
  from dual
  connect by rownum<=30
union all
--cluster 2
select 2 as cluster_number,
       ROUND(8+(case when dbms_random.value<0.5 then -1 else +1 end)*dbms_random.value*2,2) as x,
       ROUND(2+(case when dbms_random.value<0.5 then -1 else +1 end)*dbms_random.value*2,2) as y
  from dual
  connect by rownum<=40
union all
--cluster 3
select 3 as cluster_number,
       ROUND(6+(case when dbms_random.value<0.5 then -1 else +1 end)*dbms_random.value*2,2) as x,
       ROUND(6+(case when dbms_random.value<0.5 then -1 else +1 end)*dbms_random.value*2,2) as y
  from dual
  connect by rownum<=50

select x,y from delit_test_data


field cluster_number here no matter here and will necessary late for color. Now we can put point information into Excel to see a simple plot.


As you can see on the plot and in the query, we have 3 cluster centers with coordinates:

(2,2) (8,2) (6,6)

Next, we need to prepare data source for Spark, special in libsvm format. Each point has index (begin from 0),


0 1:1.46 2:2.72 3:1
1 1:1.45 2:2.71 3:1
2 1:3.1 2:0.54 3:1
3 1:2.36 2:2.13 3:1
4 1:1.26 2:1.83 3:1
5 1:0.74 2:3.25 3:1
6 1:2.33 2:1.87 3:1
7 1:1.59 2:1.71 3:1


The common view of each line, 1: a value of property #1 (X coordinate), 2: a value of property #2 (Y coordinate) and 3: color of point. Now each point has one color = 1.

SQL query for libsvm format.

And Scala code for Spark.


[root@smn ~]#spark-shell --driver-memory 1G --executor-memory 1G --driver-cores 1 --executor-cores 1

import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.evaluation.ClusteringEvaluator

val dataset = spark.read.format("libsvm").load("/root/sample_kmeans_data.txt")

// Trains a k-means model.
val kmeans = new KMeans().setK(3).setSeed(1L)
val model = kmeans.fit(dataset)

// Make predictions
val predictions = model.transform(dataset)

// Evaluate clustering by computing Silhouette score
val evaluator = new ClusteringEvaluator()

val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")

// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)


And output



2019-02-06 11:57:38 WARN  LibSVMFileFormat:66 - 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan.
2019-02-06 11:57:53 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
dataset: org.apache.spark.sql.DataFrame = [label: double, features: vector]

silhouette: Double = 0.7536538721887516

Cluster Centers:

[6.160208333333333,5.843333333333334,1.0]
[1.95,2.1493333333333333,1.0]
[7.980476190476189,2.0604761904761904,1.0]


As you can see clusters centers almost approximate to our exact centers, (2,2) (8,2) (6,6)
Now I put it into Excel with exact centers and make one more plot.


Here we can pay attention that result of KMeans is quite good.
But what about the color of points, next I will set for group (2,2) color #1 and for rest 2 groups color #2. And will search 2 clusters centers.

SQL Query


select (rownum-1)||' '||replace('1:'||x||' '||'2:'||y||' '||'3:'||
        decode(CLUSTER_NUMBER,1,1,2),':.',':0.') as point_properties
from delit_test_data


And output of Spark (changes only here val kmeans = new KMeans().setK(2).setSeed(1L)):


Cluster Centers:

[7.009666666666666,4.077999999999998,2.0]
[1.95,2.1493333333333333,1.0]


One more plot for new centers and with colors.


Also, it looks good.

Of course, we can use more properties for each point and go into N-Dimension World.

Комментарии

Популярные сообщения из этого блога

Loading data into Spark from Oracle RDBMS, CSV

Load data from Cassandra to HDFS parquet files and select with Hive

Hadoop 3.0 cluster - installation, configuration, tests on Cent OS 7