KMeans

KMeans provides a library to cluster data using the compute engine

To use KMeans in greycat, you need to add the algebra library and use the kmeans module, as follows: Compute and util libraries will be used as well for this demo

@library("algebra");
use kmeans;
use compute;
use util;

Input tensor shape for Kmeans

Since Kmeans is a non-supervised learning algorithm, we only need an input tensor. No output tensor is needed. The input Tensor should be of 2 dimensions: [dataset_size, nb_features]

Let’s say you have a dataset of 1000 buildings to cluster, each having the following features: area, latitude, longitude, number of rooms. Then your input tensor will be of shape [1000, 4]. Since we have 1000 observations, each having 4 features.

Generating a fake input tensor for the demo

In a first step, in order to generate a training data set example, let’s add the following function:

fn generateDataset(nb_features: int, nb_clusters: int, nb_elem_per_cluster: int): Tensor {
    //Generating the tensor data at random
    var t = Tensor {};
    t.init(TensorType::f64, [0, nb_features]); //0 as first dimension since we will append

    var rand = Random::new();
    rand.setSeed(12345);
    for (var i = 0; i < nb_clusters; i++) {
        //For every cluster define the centroids feature location at random
        var centroid = Array<float>::new(nb_features);
        for (var j = 0; j < nb_features; j++) {
            centroid[j] = rand.uniformf(0.0, 10000.0); //generate a random feature between 0 and 1000
        }

        //Generate the elements of the cluster and append them to the tensor
        var t_temp = Tensor {};
        t_temp.init(TensorType::f64, [nb_features]);

        for (var j = 0; j < nb_elem_per_cluster; j++) {
            for (var k = 0; k < nb_features; k++) {
                //generate the point around the centroid with some small variability of 0.1
                t_temp.set([k], centroid[k] + rand.uniformf(-0.1, 0.1));
            }
            t.append(t_temp);
        }
        info("Generated Centroid of cluster ${i} => ${centroid}");
    }
    info("Done generating the dataset, shape: ${t.shape()}");
    return t;
}

This function generates a tensor of the following shape: [nb_clusters x nb_elem_per_cluster, nb_features]

The demo dataset size will be nb_clusters x nb_elem_per_cluster, each having nb_features as number of features. And we know in advance that there are exactly nb_clusters number of clusters in this fake dataset.

Running this as follow:

fn main() {
    //We will generate fake data containing N clusters
    var nb_clusters = 5;
    var nb_elem_per_cluster = 30;
    var nb_features = 6;
    var t = generateDataset(nb_features, nb_clusters, nb_elem_per_cluster); 
}

This should give the following output after greycat run:

"Generated Centroid of cluster 0 => [4461.2258972885,2147.2052541316,7467.4348987022,9282.2437264408,6786.679381871,123.3298425206]"
"Generated Centroid of cluster 1 => [2498.5974899021,1807.8230748921,7298.3473200809,8277.8369208229,1709.0473425151,7283.8390512782]"
"Generated Centroid of cluster 2 => [1181.1360489489,8407.685020197,9910.1870925679,6010.4521997322,5498.8028088113,2712.6128285716]"
"Generated Centroid of cluster 3 => [6614.0270077689,6146.2149751122,8803.8754969853,7789.5031067494,9329.8478840524,6244.017875867]"
"Generated Centroid of cluster 4 => [1349.4688092496,4743.1528264392,3914.820195136,3749.6192025717,4407.4531199445,7215.5907364635]"
"Done generating the dataset, shape: [150,6]"

As you can see we have generated a dataset of 150 observations, each having 6 features. And we know in advance that these 150 features observations belong to 5 clusters each having 30 elements. We even know in advance the centroids of each cluster as displayed.

In reality, we wouldn’t have this information and we would like that the Kmeans to deduce it correctly

KMeans simple learning

Let’s try to cluster the data over the previously generated tensor,

    var learning_seed = 4445555;
    var rounds = 10;

    //This runs kmeans with 10 learning rounds, with the dedicated seed
    //And tries to create 3 clusters on the previously generated dataset Tensor t
    var learning = Kmeans::learning(t, 3, learning_seed, rounds, 0.0, 1.0);
    info("loss: ${learning.loss}");

In this code, we’re asking kmeans to create exactly 3 clusters over the data

The output will be:

loss: 552319.172748166

As you can see, the clustering did it’s job and put the data in 3 clusters, however the learning.loss is a bit high (since we know in advance that the data should be in 5 clusters).

KMeans meta learning

Can we get a better clustering with 3 clusters?

Since Kmeans depends on the initialization, it’s better to try different initializations and get the smallest loss, this is called meta learning

In order to do so, add the following code:

    var meta_rounds = 50;
    //This creates 50 meta rounds, each starting with a different seed 
    //All trying exactly to put the data in 3 clusters and keeping the best seed at the end
    var meta_learning = Kmeans::meta_learning(t, 3, learning_seed, meta_rounds, rounds);
    info("loss of meta learning: ${meta_learning.bestResult?.loss}");

The result of meta learning is the following:

loss of meta learning: 434309.4872805935

As we can see, testing different seeds helped to improve the clustering and we get lower loss.

Let’s try kmeans meta learning for 4 and 5 clusters.

    meta_learning = Kmeans::meta_learning(t, 4, learning_seed, meta_rounds, rounds);
    info("loss of meta learning for 4 clusters: ${meta_learning.bestResult?.loss}");

    meta_learning = Kmeans::meta_learning(t, 5, learning_seed, meta_rounds, rounds);
    info("loss of meta learning for 5 clusters: ${meta_learning.bestResult?.loss}");

Result will be:

"loss of meta learning for 4 clusters: 210387.2563267315"
"loss of meta learning for 5 clusters: 19.7191405789"

As we can see, the loss reduced a lot by going from 3 to 4 to 5 clusters.

KMeans meta meta learning

Most of the times, we don’t know in advance how many clusters we have in the data, so we want the kmean to try different clustering and stop once we see that adding more clusters is not reducing the loss a lot. Usually the elbow method is used.

In the Kmeans library, we implemented the meta_meta_learning that iterates from 2 to a maximum number of clusters (to be provided by the user) and stops the iteration if the addition of a new clustering group does not improve the loss by a certain ratio. Let’s try this:

    //Meta Meta learning the best number of clusters, 
    //By running clustering from 0 up to maximum 20 clusters
    //And stopping adding new clusters if the improvement is not more than 5%
    var start = time::now();
    var meta_meta_learning = Kmeans::meta_meta_learning(t, 20, 0.05, learning_seed, meta_rounds, rounds);
    var end = time::now();
    info("Meta meta learning, Took: ${(end - start).tof(DurationUnit::seconds)} seconds for meta meta learning, loss: ${meta_meta_learning.bestResult?.loss}, best clustering found: ${meta_meta_learning.bestResult!!.centroids!!.shape()[0]} clusters");

Result:

"Meta meta learning, Took: 0.046755 seconds for meta meta learning, loss: 19.7191405789, best clustering found: 5 clusters"

Technically, here the meta meta learning, tried to fit the data with 2 cluster and run the learning several times, then tried to fit 3 clusters and saw that the improvement was more than 5%, it continued adding clusters till 6 clusters. At 6, it saw that the clustering loss stopped improving. That’s why the final result was 5 clusters. Don’t forget that we created initially the fake data to have 5 clusters.

Displaying Kmeans result

Here is a sample code that display the kmeans centroid and results:

    if (meta_meta_learning.bestResult?.centroids != null) {
        for (var i = 0; i < meta_meta_learning.bestResult!!.centroids!!.shape()[0]; i++) {
            var found_centroid = [];
            for (var j = 0; j < meta_meta_learning.bestResult!!.centroids!!.shape()[1]; j++) {
                found_centroid.add(meta_meta_learning.bestResult!!.centroids!!.get([i, j]));
            }
            info("Kmeans found centroid of cluster: ${i} =>${found_centroid}, cluster size: ${meta_meta_learning.bestResult!!.clusters_count!!.get([i])}");
        }
        //println("Assignment tensor: ${res.bestResult?.assignment?.toString()}");
    }

This will yield to the following display:

"Kmeans found centroid of cluster: 0 =>[1349.4801783255,4743.1189285971,3914.8111425496,3749.6465072727,4407.4389406321,7215.5917719807], cluster size: 30"
"Kmeans found centroid of cluster: 1 =>[2498.5810454441,1807.8099005483,7298.3453121678,8277.8385463954,1709.0583863704,7283.8298470643], cluster size: 30"
"Kmeans found centroid of cluster: 2 =>[4461.2158240257,2147.2107723045,7467.4298211694,9282.2414078383,6786.6870032717,123.3179413488], cluster size: 30"
"Kmeans found centroid of cluster: 3 =>[1181.1399824462,8407.6973420244,9910.191830847,6010.4635405612,5498.7854923874,2712.6185790087], cluster size: 30"
"Kmeans found centroid of cluster: 4 =>[6614.0384305896,6146.2165336961,8803.8599381524,7789.5045214521,9329.8531760851,6244.0173284767], cluster size: 30"

Note that kmeans manage to find the same centroids as generated in the fake data, however the clustering has different ordering.

KMeans inference

Before going to the inference code, let’s add these 2 utility functions:

fn generateMiniBatch(maxBatchSize: int, nb_features: int): Tensor {
    var mb = Tensor {};
    mb.init(TensorType::f64, [maxBatchSize, nb_features]);
    var rand = Random::new();
    rand.setSeed(4466);

    for (var i = 0; i < maxBatchSize; i++) {
        for (var j = 0; j < nb_features; j++) {
            mb.set([i, j], rand.uniformf(0.0, 10000.0)); //generate a random feature between 0 and 1000
        }
    }
    return mb;
}

fn displayResult(mb: Tensor, clustering: Tensor, distances: Tensor) {
    var mb_size = mb.shape()[0];
    var features = mb.shape()[1];
    var nb_clusters = distances.shape()[1];
    for (var i = 0; i < mb_size; i++) {
        print("[");
        for (var j = 0; j < features; j++) {
            print(mb.get([i, j]));
            if (j != features - 1) {
                print(",");
            }
        }
        print("] => Clustered as class ${clustering.get([i])}, distances: [");
        for (var j = 0; j < nb_clusters; j++) {
            print("${j}: ${distances.get([i, j])}");
            if (j != nb_clusters - 1) {
                print(", ");
            }
        }
        println("]");
    }
}

generateMiniBatch is a function that generates a minibatch at random to cluster. displayResult reformats the results to better understand the kmeans result.

In order to use kmeans for inference, we need to create a compute Engine, initialize it for kmeans inference, and copy the centroids information. Kmeans library offers this by the following code:

    //Inference mode
    var maxBatchSize = 2;
    var engine = Kmeans::getInferenceEngine(meta_meta_learning.bestResult!!, maxBatchSize);
    var minibatch = generateMiniBatch(maxBatchSize, nb_features); 

    var clusters = Kmeans::cluster(engine, minibatch);
    var distance = Kmeans::getDistances(engine);
    displayResult(minibatch, clusters, distance);

Running this you get:

[5616.5656706395,4350.045199669,2695.655032385,6955.9812624734,5251.5195101739,3813.2451213027] => Clustered as class 0, distances: [0: 6512.9013870708, 1: 7981.9071707246, 2: 7095.1346362469, 3: 9505.4513924496, 4: 8047.8301781161]
[7892.8689695396,7964.7184712648,5881.9533772217,4218.8531552529,6400.3502607347,4148.2945923453] => Clustered as class 4, distances: [0: 8406.2150492499, 1: 10831.767845516, 2: 9493.2064345516, 3: 8219.1297431623, 4: 6261.2688084626]

We have generated a minibatch of 2 observations, we see the features generated then the clustering result. The first one was clustered as class 0, the second one was clustered as class 4. The display function prints as well the distance between the observation and the centroids of all clusters. You can verify then that the distance to cluster 0 is the smallest for the first observation and the distance to cluster 4 is the smallest for the second observation.

Scalable Kmeans

In all the previous examples, we put the whole dataset in one tensor and run kmeans on it. In reality, the dataset might be big and we need to minibatch the learning itself too.

This is how to do it:

fn scalable_kmeans(): KmeanResult {
    var features: int = 10;
    var tensorType = TensorType::f64;
    var nbClusters = 5;
    var featuresMin = 0.0;
    var featuresMax = 1.0;
    var seed = 1234;
    var rounds = 20; 
    var batchSize = 100;
    var nb_batches = 1000;

    //info("Compute will take: ${memorySize} bytes for batch size of ${batchSize}");
    var result: KmeanResult = KmeanResult {
        loss: float::max,
        roundsDistances: [],
    };

    var engine: ComputeEngine = ComputeEngine::new();
    var model: ComputeModel = Kmeans::configure(nbClusters, features, tensorType, featuresMin, featuresMax);
    var memorySize: int = engine.compile(model, batchSize);

    Kmeans::initialize(engine, seed);
    var sumOfDistances = 0.0;

    var minibatch = Tensor {};
    minibatch.init(tensorType, [batchSize, features]);

    for (var round = 0; round < rounds ?? Kmeans::default_rounds; round++) {
        if (round > 1) {
            Kmeans::replaceEmptyClusters(engine);
            Kmeans::sortClusters(engine);
        }
        Kmeans::init_round(engine);

        for (var i = 0; i < nb_batches; i++) {
            //todo fill the minibatch tensor here
            Kmeans::learn(engine, minibatch);
        }

        Kmeans::end_round(engine);
        sumOfDistances = Kmeans::getSumOfDistances(engine).get([0]) as float;
        result.roundsDistances.add(sumOfDistances);
    }
    Kmeans::calculate_stats(engine);

    result.loss = sumOfDistances;
    result.centroids = Kmeans::getClustersCentroids(engine);
    result.clusters_count = Kmeans::getClustersCounts(engine);
    result.clusters_avg_distance = Kmeans::getClustersAvgOfDistances(engine);
    result.clusters_sum_distance = Kmeans::getClustersSumOfDistances(engine);
    result.assignement = Kmeans::getAssignement(engine);
    result.distances = Kmeans::getBestDistances(engine);
    result.clusterInterDistances = Kmeans::getClustersDistancesToEachOther(engine);

    return result;
}