In this page
CSV Analysis feature of GreyCat
This feature of GreyCat is a very fast and powerful type deduction feature, which allows the user to explore millions of CSV file lines in a matter of seconds.
It allows you to explore:
8,000,000 lines of Csv (1.3 GB, 15 columns) in 4.8s,
Average speed of: 1.7M rows/second (277 MB/s.) with unlimited date check limits.
Analysis
Dataset
timestamp, value, category, status
2023-01-01 12:00:00, 100.5, "A", true
2023-01-02 12:00:00, 105.2, "B", false
2023-01-03 12:00:00, 102.8, "A", true
2023-01-04 12:00:00, 110.0, "C", false
2023-01-05 12:00:00, 108.3, "B", true
Code
var files = Array<File> { File { path: "input.csv" } };
var config = CsvAnalysisConfig { separator: ',' };
var csv_stats = Csv::analyze(files, CsvAnalysisConfig { separator: ',' });
Result
Analyzing the above CSV would yield:
CsvStatistics {
header_lines: 1,
separator: ',',
string_delimiter: null,
decimal_separator: null,
thousands_separator: null,
columns: Array<CsvColumnStatistics> { /*...*/ },
line_count: 5,
fail_count: 0,
file_count: 1
}
Analysis: column 0
CsvColumnStatistics {
name: "timestamp",
example: "2023-01-01 12:00:00",
null_count: 0,
bool_count: 0,
int_count: 0,
float_count: 0,
string_count: 0,
date_count: 5,
date_format_count: Map<String,int> {
"%Y-%m-%d %H:%M:%S": 5
},
enumerable_count: Map<any,int> {},
profile: Gaussian {
sum: null,
sumsq: null,
count: null,
min: null,
max: null
}
}
Analysis: column 1
CsvColumnStatistics {
name: "value",
example: 100.5,
null_count: 0,
bool_count: 0,
int_count: 0,
float_count: 5,
string_count: 0,
date_count: 0,
date_format_count: Map<String,int> {},
enumerable_count: Map<any,int> {},
profile: Gaussian {
sum: 526.8,
sumsq: 55564.02,
count: 5,
min: 100.5,
max: 110.0
}
}
Analysis: column 2
CsvColumnStatistics {
name: "category",
example: "A",
null_count: 0,
bool_count: 0,
int_count: 0,
float_count: 0,
string_count: 5,
date_count: 0,
date_format_count: Map<String,int> {},
enumerable_count: Map<any,int> {
"C": 1,
"B": 2,
"A": 2
},
profile: Gaussian {
sum: null,
sumsq: null,
count: null,
min: null,
max: null
}
}
Analysis: column 3
CsvColumnStatistics {
name: "status",
example: true,
null_count: 0,
bool_count: 5,
int_count: 0,
float_count: 0,
string_count: 0,
date_count: 0,
date_format_count: Map<String,int> {},
enumerable_count: Map<any,int> {},
profile: Gaussian {
sum: null,
sumsq: null,
count: null,
min: null,
max: null
}
}
Code generation
From the CsvStatistics we can generate the GCL types/enums for the reader:
var code = Csv::generate(stats);
println(code);
@volatile
private type Record {
/// column=0
@format("%Y-%m-%d %H:%M:%S")
timestamp: time;
/// column=1, min=100.5, max=110.0, avg=105.36
value: float;
/// column=2
category: Category;
/// column=3
status: bool;
}
@volatile
private enum Category {
C,
B,
A,
}
This feature is available in the explorer to ease the import of multiple complex datasets
