Data Structures
GreyCat offers a selection of data structures to support organizing and computing on large data.
Tuple
Simple association data structure to handle couple of values. Can be specialized by generic type T and U respectively for left and right-hand values.
var tupleA = Tuple{x:0.5,y:"a"}; // Tuple{x:0.5,y:"a"}
var tupleB = (0.5,"b"); // Tuple{x:"b",y:0.5}
Arrays and maps
GreyCat provides Array
and Map
, which are in-memory and meant for small amounts of data
fn main() {
var arrayA = Array<float>{1.2, 3.4, 5.0, 4.1};
for (k, v in arrayA) {
println("Index: ${k}, value ${v}");
}
// short notation, drawback is that the typing stays unknown
var arrayB = [1.2, 3.4, 5.0, 4.1];
for (k, v in arrayB) {
println("Index: ${k}, value ${v}");
}
// Index: 0, value 1.2
// Index: 1, value 3.4
// Index: 2, value 5.0
// Index: 3, value 4.1
// Index: 0, value 1.2
// Index: 1, value 3.4
// Index: 2, value 5.0
// Index: 3, value 4.1
}
Similarly:
fn main() {
var map = Map<String, int>{};
map.set("Hello", 5);
map.set("Test", 2);
println(map.get("Test"));
}
Arrays and Maps are useful for small amounts of data. For large datasets, use nodeList
and nodeIndex
respectively.
Windows
Windows are FIFO (First In First Out) structures with a fixed size. They are used to collect a number of numerical values, and provide handy methods to get statistics on this set of values.
There exist two types of windows in GreyCat, TimeWindow
where the size is defined in time (thus the number of values can vary),
and SlidingWindow
where the number of elements is fixed.
Both are presented next.
TimeWindow
Time windows are convenient to collect values within a given period of time.
Developers would simply create a TimeWindow
and specify the maximum time separating the first and last value of the set.
The TimeWindow
will automatically discard old elements when the max duration between elements is reached.
In the following example, a TimeWindow
is used to compute the average of a value by periods of 5 seconds:
fn main() {
var tw = TimeWindow<float> {span: 5s };
for (var t = 0; t < 51; t++) {
// add the value to the time window.
tw.add(time::new(t, DurationUnit::seconds), t as float);
// every five seconds
if (t != 0 && t % 5 == 0) {
// displays structure of TimeWindow:
var t_start = tw.values.get_cell(0, 0);
var t_start_s = t_start.to(DurationUnit::seconds);
var t_end = tw.values.get_cell(tw.size() - 1, 0);
var t_end_s = t_end.to(DurationUnit::seconds);
println("window size: ${tw.size()}, range: [${t_start_s}, ${t_end_s}], ");
// displays the average computed over all values from the last 5 seconds
println("average: ${tw.avg()}");
// display first/ last element in the window
println("first: ${tw.min()}, last: ${tw.max()}");
}
}
}
// window size: 6, range: [0, 5],
// average: 2.5first: Tuple{x:'1970-01-01T00:00:00Z',y:0.0}, last: Tuple{x:'1970-01-01T00:00:05Z',y:5.0}
// window size: 6, range: [5, 10],
// average: 7.5first: Tuple{x:'1970-01-01T00:00:05Z',y:5.0}, last: Tuple{x:'1970-01-01T00:00:10Z',y:10.0}
// window size: 6, range: [10, 15],
// average: 12.5first: Tuple{x:'1970-01-01T00:00:10Z',y:10.0}, last: Tuple{x:'1970-01-01T00:00:15Z',y:15.0}
// window size: 6, range: [15, 20],
// average: 17.5first: Tuple{x:'1970-01-01T00:00:15Z',y:15.0}, last: Tuple{x:'1970-01-01T00:00:20Z',y:20.0}
// window size: 6, range: [20, 25],
// average: 22.5first: Tuple{x:'1970-01-01T00:00:20Z',y:20.0}, last: Tuple{x:'1970-01-01T00:00:25Z',y:25.0}
// window size: 6, range: [25, 30],
// average: 27.5first: Tuple{x:'1970-01-01T00:00:25Z',y:25.0}, last: Tuple{x:'1970-01-01T00:00:30Z',y:30.0}
// window size: 6, range: [30, 35],
// average: 32.5first: Tuple{x:'1970-01-01T00:00:30Z',y:30.0}, last: Tuple{x:'1970-01-01T00:00:35Z',y:35.0}
// window size: 6, range: [35, 40],
// average: 37.5first: Tuple{x:'1970-01-01T00:00:35Z',y:35.0}, last: Tuple{x:'1970-01-01T00:00:40Z',y:40.0}
// window size: 6, range: [40, 45],
// average: 42.5first: Tuple{x:'1970-01-01T00:00:40Z',y:40.0}, last: Tuple{x:'1970-01-01T00:00:45Z',y:45.0}
// window size: 6, range: [45, 50],
// average: 47.5first: Tuple{x:'1970-01-01T00:00:45Z',y:45.0}, last: Tuple{x:'1970-01-01T00:00:50Z',y:50.0}
Sliding Window
Sliding windows are convenient to collect several values. Developers would simply create a SlidingWindow
and specify the maximum number of values in the window.
The SlidingWindow
will automatically discard the last element when the max size is reached.
In the following example, a SlidingWindow
is used to compute the average over 5 values:
fn main() {
var sw = SlidingWindow<float>{ span: 5 };
for (var i = 0; i < 51; i++) {
// add the value to the SlidingWindow (as floats)
sw.add(i as float);
// every five values
if (i != 0 && i % 5 == 0) {
// displays the average computed the last 5 values
println("average over ${sw.size()}: ${sw.avg()}");
// displays the average computed over all values from the last 5 seconds
println("average: ${sw.avg()}");
// display first/ last element in the window
println("first: ${sw.min()}, last: ${sw.max()}");
}
}
}
// average over 5: 3.0
// average: 3.0
// first: 1.0, last: 5.0
// average over 5: 8.0
// average: 8.0
// first: 6.0, last: 10.0
// average over 5: 13.0
// average: 13.0
// first: 11.0, last: 15.0
// average over 5: 18.0
// average: 18.0
// first: 16.0, last: 20.0
// average over 5: 23.0
// average: 23.0
// first: 21.0, last: 25.0
// average over 5: 28.0
// average: 28.0
// first: 26.0, last: 30.0
// average over 5: 33.0
// average: 33.0
// first: 31.0, last: 35.0
// average over 5: 38.0
// average: 38.0
// first: 36.0, last: 40.0
// average over 5: 43.0
// average: 43.0
// first: 41.0, last: 45.0
// average over 5: 48.0
// average: 48.0
// first: 46.0, last: 50.0
Table
Table
is a core GreyCat type, that serves as a generic two-dimensional container.
It is typically used to return a result set.
For example, web components can handle Table objects returned by the GreyCat backend.
Also, the Explorer can display Table objects.
Sampling results can also be expressed as Tables.
Data elements
Tables are populated one cell at a time. Not all cells need to contain values (null in this case).
fn main() {
var t = Table{}; // creates empty table
t.init(2,4); // initiates table with 2 rows and 4 columns
t.set_cell(0, 1, "onetwothree..."); // 1st row, 2nd column
t.set_cell(0, 2, time::now());
info(t.get_cell(0,0));
var row = ["...threefive", 0.0, time::now()];
t.set_row(1,row);
info(t.rows()); // 2
t.remove_row(0); // removes row 0
info(t.get_cell(0, 0)); // "...threefive"
}
A Table
can be sorted along one column.
t.sort(1,SortOrder::asc); // sorts by ascending order
Applying mappings
A Table
can be transformed by applying mappings to its columns.
A mapping is a series of extractors to apply to a specific column on a Table
.
type MyObject{
a:String;
b:NestedObject;
}
type NestedObject{
c:String;
}
fn main() {
var t = Table {};
t.init(0, 3);
var mappings = Array<TableColumnMapping>{
TableColumnMapping { column: 0, extractors: Array<any> {"*", "a"} }, // resolve the node get the attribute
TableColumnMapping { column: 1, extractors: Array<any> {"a"} }, // resolve the field
TableColumnMapping { column: 1, extractors: Array<any> {"c", "d"} }, // resolve the nested field
TableColumnMapping { column: 2, extractors: Array<any> {0} } // resolve the offset
};
var nestedObj = NestedObject{c: "nested value"};
var obj = MyObject{
a: "attribute a",
b: nestedObj
};
t.set_cell(0, 0, node<MyObject>{obj});
t.set_cell(0, 1, obj);
t.set_cell(0, 2, ["array index 0"]);
var newTable = Table::applyMappings(t, mappings);
info(newTable);
}
The new Table
wil contain 4 new columns with our specified mappings.
[
{
"_type": "core.node",
"ref": "0440000000000000"
},
{
"a": "attribute a",
"c": {
"d": "nested value"
}
},
[
"array index 0"
],
"attribute a", // resolved from the node
"attribute a", // resolved from the object
"nested value", // resolved from the nested object
"array index 0" // resolved from the array
]
Tensor
One powerful feature of GreyCat is its ability to run with a limited amount of RAM complex computations even on big datasets. In order to achieve this goal, we need to split the data into small chunks that can fit in RAM and process them. In machine learning, we call this batch processing. In GreyCat we have re-implemented the most useful machine learning algorithms in a streamable/batch-able way in order to be able to treat billions of observations without requiring a large IT infrastructure.
Since most machine learning algorithms deal with multidimensional numerical data, the most suitable structure to organize such data is a Tensor
. You can view the Tensor
as multidimensional compact array
Creating a tensor
This is the code to create and initialize a 2D Tensor
, with 4 rows and 3 columns. The data in the Tensor
will be of type float 64 bits.
fn main(){
var t = Tensor{};
t.init(TensorType::f64,Array<int> {4, 3}); //Creates a 2 dimensional tensor of 4 rows and 3 columns = 12 elements in total
Assert::equals(t.dim(),2);
Assert::equals(t.size(),12);
println(t);
}
Other supported data types for Tensors
are: i32 (integer 32 bits), i64 (integer 64 bits), f32 (float 32 bits), f64 (float 64 bits), c64 (complex numbers 64 bits - 32 for real and 32 for imaginary parts), c128 (complex numbers 128 bits - 64 for real and 64 for imaginary parts)
Set and get
In this example, we create a 3D Tensor
of 5 x 4 x 3 = 60 elements size. We set the first element to 42.3, then we get the value to verify it. In the last line we fill the whole Tensor
with 50.3
fn main(){
var t = Tensor{};
t.init(TensorType::f64,Array<int> {5, 4, 3}); //Creates a 3 dimensional tensor of 5 x 4 x 3 = 60 elements in total
t.set(Array<int> {0, 0, 0}, 42.3);
Assert::equals(t.get(Array<int> {0, 0, 0}),42.3);
t.fill(50.3);
}
To iterate on all the elements a multidimensional Tensor
, here is what to do:
fn main() {
var t = Tensor{};
t.init(TensorType::f64, Array<int> {2, 2, 3}); //Creates a 3 dimensional tensor of 2 x 2 x 3 = 12 elements in total
var random = Random{};
var index = t.initPos(); //init the array to the correct shape of the tensor, in this case [0,0,0]
do {
t.set(index, random.uniformf(-5.0, 5.0));
println(index); // to see how the N dimensional index follows the shape of the tensor
} while (t.incPos(index)); // the incPos will increase the ND array 1 step
}
Utility methods
A first useful method in Tensor
is the append method. Since many times we get the data in a streamable way we can append the data to the Tensor
as it arrives.
If the Tensor
has 1 dimension, the append method takes as an argument a number, an array of numbers, or a Tensor
1D.
fn main() {
var t = Tensor{};
t.init(TensorType::f64, Array<int> {0}); //Creates a 1 dimensional tensor with 0 elements in it
var t2 = Tensor{};
t2.init(TensorType::f64, Array<int> {3}); //T2 is a 1D tensor of 3 elements filled with 5.0
t2.fill(5.0);
t.append(3.0); //Appends 1 value
t.append([4.0,4.0]); //Appends 2 values coming from an array
t.append(t2); //appends 3 values coming from a 1D tensor
println(t.toTable());
}
For tensors with more than 1 dimensions, let’s say N = 4, we can only append a Tensor
of N-1 dimensions.
For example:
fn main() {
var t = Tensor{};
t.init(TensorType::f64, Array<int> {0, 2, 3, 4}); //Creates a 4 dimensional tensor of 0 x 2 x 3 x 4 = 0 elements in total, however the tensor has now the mandatory shapes of the last 3 dimensions 2 x 3 x 4
var t2 = Tensor{};
t2.init(TensorType::f64, Array<int> {2, 3, 4});
t2.fill(5.0);
t.append(t2); //appends the t2 to t
t2.fill(9.0);
t.append(t2); //appends the t2 to t
}
Notice how we can initialize always the first dimension of the Tensor
to 0, actually this is the only dimension that we allow to change with each append.
For performance and in order to avoid re-allocating the Tensor
when its size increases, there is the setCapacity methods. It allows us to set the capacity of a Tensor
even if the first dimension is 0.
If we add this line after init in the previous example, the Tensor
will have directly a capacity to hold 1000 elements before the appends happen.
t.setCapacity(1000);
Finally in order to re-use the Tensor
memory but with different shape, a method reset exists to allow changing the shape of the Tensor
. As an example:
fn main() {
var t = Tensor{};
t.init(TensorType::f64, Array<int> {2, 3, 2});
t.fill(5.0);
t.reset();
t.init(TensorType::f64, Array<int> {1, 2, 3});
}
This Tensor
will reuse its same memory space, with different shapes.
Buffer
Buffer
is an efficient string buffer: it allows you to create and append data to a String type.
fn main() {
var b = Buffer{};
b.add(1);
b.add(" one ");
b.add([1, 2]);
println(b.toString()); // 1 one Array{1,2}
}