Creation of KPIs
GreyCat finds its best usage when key indicators are computed and available alongside the core data. This chapter will walk you through the process of computing and adding statistical indicators in the model.
The goal
Building on our case of JCDecaux Bikes started in this section, we will add an attribute to host statistics on the bikes’ availability.
The mean - GaussianProfile
We are going to add an attribute to our Stations, that will provide statistics on the availability of bikes for each hour of the day, for the seven days of the week. This will allow to plot, on each day of the week, what is the average number of bikes available per hour.
The GaussianProfile can be seen as an array where each cell is a distinct Gaussian distribution. In our example, we will need an array of 24 hours x 7 days, i.e. 168 slots. We will therefore have to:
- Add an attribute in our model
- Initialize the attribute for each station
- Update the Gaussian profile, in the appropriate slot, each time we get a new data
The doing
First we need to update our model. In the file model/station.gcl
, we add an instruction on top that we use the util
module of GreyCat containing the GaussianProfile type. We then add an attribute available_bikes_profile of type node
use util;
type Station {
[...]
available_bikes: nodeTime<int>;
available_bikes_profile: node<GaussianProfile>;
available_stands: nodeTime<int>;
[...]
We need to initialize this attribute when we create the stations, so we add the following in the creation of Stations.
Here we create a GaussianProfile on 24 hours x 7 days slots and wrap it in a new node. N.B. you also need to add the instruction use util;
in the headers of your project.gcl
file.
stationNode = node<Station>::new(Station{
[...]
available_bikes: nodeTime<int>::new(),
available_bikes_profile: node<GaussianProfile>::new(GaussianProfile::new(24 * 7)),
available_stands: nodeTime<int>::new(),
[...]
Last but not lease, update the slots for each record we parse. To do this, we update the loop where we fill the time series. We start with creating a Date from the last_update time. Note here that we get the date in the TimeZone of Brussels, so the profile will be in the correct time zone. This will allow us to know, for the record, what is the day of the week and hour, to compute the appropriate slot to update. Here it goes.
for( _, record in stationRecords) {
var lastUpdate = time::new(record.get("last_update") as int, DurationUnit::milliseconds);
var recordDate = Date::fromTime(lastUpdate, TimeZone::Europe_Brussels);
var slotId = 24 * recordDate.dayOfWeek() + recordDate.hours();
var nbBikesAvailable = record.get("available_bikes") as int;
station.available_bikes_profile->add(slotId, nbBikesAvailable as float);
[...]
All set. Now the question is: how do we fill the GaussianProfile ?
Two options available:
- You can create another function, aside of the main, which only task will be to update upgrade the existing Station nodes and populate the GaussianProfiles. For the sake of time, and because we have a very small dataset, we will not go for this one.
- Delete the data you already have (by deleting the
gcdata
folder at the root of your project) and run again the main script function. This solution is handy at first to iterate fast, but calls for small datasets, because you need to import all again. GreyCat has been built to perform these tasks quite quickly, but to some extent it becomes very tedious.
The use
We now have a GaussianProfile up to date until the last entry of our dataset, let see what we can do with this.
Availability of bikes on Thursdays
Let see how to extract the average number of bikes available on Thursdays, hour per hour.
We will present the data in a Table which lines will be each station of Brussels, and columns each hour.
We add this function aside the main function at the root of our project.gcl
file.
fn bikesPerHour() {
var day = 4; //Sunday=0, Saturday=6
var baseSlot = day * 24;
var endSlot = (day+1) * 24;
var result = Table::new(25); //One per hour plus station name
var tableLine = 0;
//For each station
for(stationName, stationNode in stations_by_name) {
//Set the name of the station in first colum of current line
result.set(tableLine, 0, stationName);
//Get available bikes profile and resolve it (to not resolve each time)
var availableBikesProfile = *stationNode->available_bikes_profile;
//Fill the remaining columns with the average number of bikes, reduced to an integer
var col = 1;
var currentSlot = baseSlot;
while(currentSlot < endSlot) {
result.set(tableLine, col, availableBikesProfile.avg(currentSlot) as int);
currentSlot++;
col++;
}
tableLine++;
}
//Display result in console
println(result);
}
Then when you run greycat run bikedPerHour
you get the Table displayed as a JSON object. The attribute meta gives information on each column; the field data contains the 2-dimensional array of data we created. Each line presents the name of the station, followed by the truncated average number of bikes available per hour of the day, starting at midnight.
This table could then be exploited in third-party systems or front-end presentations.
The publication of this function as API for third-party availability is described in the next section