Wednesday, November 20, 2019

Partitioning is probably the reason why tools like Spark, Flink becomes popular for BigData:
Partition function of FLink can be demonstrated with the following:
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<Integer, Double>> sorted = env.readCsvFile("ratings.csv")
.ignoreFirstLine()
.includeFields(false, true, true, false)
.types(Long.class, Double.class)
.groupBy(0)
.reduceGroup(new GroupReduceFunction<Tuple2<Long, Double>, Tuple2<Long, Double>>() {
    @Override
    public void reduce(Iterable<Tuple2<Long, Double>> values, Collector<Tuple2<Long, Double>>out) throws Exception {
        Long movieId = null;
        double total = 0;
        int count = 0;
        for (Tuple2<Long, Double> value: iterable) {
            movieId = value.f0;
            total += value.f1;
            count++;
        }
        if (count > 50){
            Collector.collect(new Tuple2<>(movieId, total/count);
        }
    }
})
.partitionCustom(new Partitioner<Double>() {
    @Override
    public int partition(Double key, int numPartition) {
        return key.intValue() - 1;
    }
    }, 1);
}

No comments:

Post a Comment