Creation of KPIs

GreyCat finds its best usage when key indicators are computed and available alongside the core data. This chapter will walk you through the process of computing and adding statistical indicators in the model.

The goal

Building on our case of JCDecaux Bikes started in this section, we will add an attribute to host statistics on the bikes’ availability.

The mean - GaussianProfile

We are going to add an attribute to our Stations, that will provide statistics on the availability of bikes for each hour of the day, for the seven days of the week. This will allow to plot, on each day of the week, what is the average number of bikes available per hour.

The GaussianProfile can be seen as an array where each cell is a distinct Gaussian distribution. In our example, we will need an array of 24 hours x 7 days, i.e. 168 slots. We will therefore have to:

Add an attribute in our model
Initialize the attribute for each station
Update the Gaussian profile, in the appropriate slot, each time we get a new data

The doing

First we need to update our model. In the file model/station.gcl, we add an instruction on top that we use the util module of GreyCat containing the GaussianProfile type. We then add an attribute available_bikes_profile of type node. As you can see, the gaussian profile is set in a separate node, to allow loading stations without loading the profile all the time, only on demand. This also allows referencing the profile from elsewhere if needed later. The Station type should now start like this.

use util;

type Station {
    [...]
    available_bikes: nodeTime<int>;
    available_bikes_profile: node<GaussianProfile>;
    available_stands: nodeTime<int>;
[...]

We need to initialize this attribute when we create the stations, so we add the following in the creation of Stations. Here we create a GaussianProfile on 24 hours x 7 days slots and wrap it in a new node. N.B. you also need to add the instruction use util; in the headers of your project.gcl file.

stationNode = node<Station>::new(Station{
    [...]
    available_bikes: nodeTime<int>::new(),
    available_bikes_profile: node<GaussianProfile>::new(GaussianProfile::new(24 * 7)),
    available_stands: nodeTime<int>::new(),
    [...]

Last but not lease, update the slots for each record we parse. To do this, we update the loop where we fill the time series. We start with creating a Date from the last_update time. Note here that we get the date in the TimeZone of Brussels, so the profile will be in the correct time zone. This will allow us to know, for the record, what is the day of the week and hour, to compute the appropriate slot to update. Here it goes.

for( _, record in stationRecords) {
    var lastUpdate = time::new(record.get("last_update") as int, DurationUnit::milliseconds);
    var recordDate = Date::fromTime(lastUpdate, TimeZone::Europe_Brussels);
    var slotId = 24 * recordDate.dayOfWeek() + recordDate.hours();
    var nbBikesAvailable = record.get("available_bikes") as int;
    station.available_bikes_profile->add(slotId, nbBikesAvailable as float);
    [...]

All set. Now the question is: how do we fill the GaussianProfile ?
Two options available:

You can create another function, aside of the main, which only task will be to update upgrade the existing Station nodes and populate the GaussianProfiles. For the sake of time, and because we have a very small dataset, we will not go for this one.
Delete the data you already have (by deleting the gcdata folder at the root of your project) and run again the main script function. This solution is handy at first to iterate fast, but calls for small datasets, because you need to import all again. GreyCat has been built to perform these tasks quite quickly, but to some extent it becomes very tedious.

The use

We now have a GaussianProfile up to date until the last entry of our dataset, let see what we can do with this.

Availability of bikes on Thursdays

Let see how to extract the average number of bikes available on Thursdays, hour per hour.
We will present the data in a Table which lines will be each station of Brussels, and columns each hour.
We add this function aside the main function at the root of our project.gcl file.

fn bikesPerHour() {
    var day = 4; //Sunday=0, Saturday=6
    var baseSlot = day * 24;
    var endSlot = (day+1) * 24;
    
    var result = Table::new(25); //One per hour plus station name
    var tableLine = 0;

    //For each station
    for(stationName, stationNode in stations_by_name) {
        //Set the name of the station in first colum of current line
        result.set(tableLine, 0, stationName);

        //Get available bikes profile and resolve it (to not resolve each time)
        var availableBikesProfile = *stationNode->available_bikes_profile;
        //Fill the remaining columns with the average number of bikes, reduced to an integer
        var col = 1;
        var currentSlot = baseSlot;
        while(currentSlot < endSlot) {
            result.set(tableLine, col, availableBikesProfile.avg(currentSlot) as int);
            currentSlot++;
            col++;
        }
        tableLine++;
    }
    //Display result in console
    println(result);
}

Then when you run greycat run bikedPerHour you get the Table displayed as a JSON object. The attribute meta gives information on each column; the field data contains the 2-dimensional array of data we created. Each line presents the name of the station, followed by the truncated average number of bikes available per hour of the day, starting at midnight.
This table could then be exploited in third-party systems or front-end presentations. The publication of this function as API for third-party availability is described in the next section