Get TopN of All Groups After Group by Using Spark DataFrame

https://stackoverflow.com/questions/33655467/get-topn-of-all-groups-after-group-by-using-spark-dataframe

You can use rank window function as follows

1
2
3
4
5
6
7
8
9
10
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{rank, desc}

val n: Int = ???

// Window definition
val w = Window.partitionBy($"user").orderBy(desc("rating"))

// Filter
df.withColumn("rank", rank.over(w)).where($"rank" <= n)

If you don’t care about ties then you can replace rank with rowNumber

Share