7.0.1685-testing
In this page
Csv Analysis feature of GreyCat
This feature of GreyCat is very fast and powerful type deduction feature, which allows the user to explore millions of Csv file lines in a matter of seconds.
It allows you to explore:
8,000,000 lines of Csv (1.3 GB, 15 columns) in 4.8s,
Average speed of: 1.7M rows/second (277 MB/s.)
with unlimited date check limits.
Analysis
Dataset
timestamp, value, category, status
2023-01-01 12:00:00, 100.5, "A", true
2023-01-02 12:00:00, 105.2, "B", false
2023-01-03 12:00:00, 102.8, "A", true
2023-01-04 12:00:00, 110.0, "C", false
2023-01-05 12:00:00, 108.3, "B", true
Result
Analyzing the above CSV would yield:
CsvStatistics {
header_lines: 1,
separator: ',',
string_delimiter: null,
decimal_separator: null,
thousands_separator: null,
columns: Array<CsvColumnStatistics> { /*...*/ },
line_count: 5,
fail_count: 0,
file_count: 1
}
Analysis: column 0
CsvColumnStatistics {
name: "timestamp",
example: "2023-01-01 12:00:00",
null_count: 0,
bool_count: 0,
int_count: 0,
float_count: 0,
string_count: 0,
date_count: 5,
date_format_count: Map<String,int> {
"%Y-%m-%d %H:%M:%S": 5
},
enumerable_count: Map<any,int> {},
profile: Gaussian {
sum: null,
sumsq: null,
count: null,
min: null,
max: null
}
}
Analysis: column 1
CsvColumnStatistics {
name: "value",
example: 100.5,
null_count: 0,
bool_count: 0,
int_count: 0,
float_count: 5,
string_count: 0,
date_count: 0,
date_format_count: Map<String,int> {},
enumerable_count: Map<any,int> {},
profile: Gaussian {
sum: 526.8,
sumsq: 55564.02,
count: 5,
min: 100.5,
max: 110.0
}
}
Analysis: column 2
CsvColumnStatistics {
name: "category",
example: "A",
null_count: 0,
bool_count: 0,
int_count: 0,
float_count: 0,
string_count: 5,
date_count: 0,
date_format_count: Map<String,int> {},
enumerable_count: Map<any,int> {
"C": 1,
"B": 2,
"A": 2
},
profile: Gaussian {
sum: null,
sumsq: null,
count: null,
min: null,
max: null
}
}
Analysis: column 3
CsvColumnStatistics {
name: "status",
example: true,
null_count: 0,
bool_count: 5,
int_count: 0,
float_count: 0,
string_count: 0,
date_count: 0,
date_format_count: Map<String,int> {},
enumerable_count: Map<any,int> {},
profile: Gaussian {
sum: null,
sumsq: null,
count: null,
min: null,
max: null
}
}
Code generation
From the CsvStatistics
we can generate the GCL types/enums for the reader:
var code = Csv::generate(stats);
println(code);
@volatile
private type Record {
/// column=0
@format("%Y-%m-%d %H:%M:%S")
timestamp: time;
/// column=1, min=100.5, max=110.0, avg=105.36
value: float;
/// column=2
category: Category;
/// column=3
status: bool;
}
@volatile
private enum Category {
C,
B,
A,
}
This feature is available in the explorer to ease the import of multiple complex datasets