Cluster computing

Wednesday, November 20, 2019

Partitioning is probably the reason why tools like Spark, Flink becomes popular for BigData:
Partition function of FLink can be demonstrated with the following:
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<Integer, Double>> sorted = env.readCsvFile("ratings.csv")
.ignoreFirstLine()
.includeFields(false, true, true, false)
.types(Long.class, Double.class)
.groupBy(0)
.reduceGroup(new GroupReduceFunction<Tuple2<Long, Double>, Tuple2<Long, Double>>() {
@Override
public void reduce(Iterable<Tuple2<Long, Double>> values, Collector<Tuple2<Long, Double>>out) throws Exception {
Long movieId = null;
double total = 0;
int count = 0;
for (Tuple2<Long, Double> value: iterable) {
movieId = value.f0;
total += value.f1;
count++;
}
if (count > 50){
Collector.collect(new Tuple2<>(movieId, total/count);
}
}
})
.partitionCustom(new Partitioner<Double>() {
@Override
public int partition(Double key, int numPartition) {
return key.intValue() - 1;
}
}, 1);
}

Cluster computing

Wednesday, November 20, 2019

No comments:

Post a Comment