Spark Array_contains判断json数组中是否存在某个元素

在非spark-shell环境中需要使用如下方式导入包

1
import org.apache.spark.sql.functions._

对于结构如下的数据

1
2
3
4
5
6
7
scala> res5.printSchema
root
|-- cats: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- act: string (nullable = true)
| | |-- cat: string (nullable = true)
|-- u_i: string (nullable = true)

先使用下边的方式将json嵌套数据转换成单个字段的数组

Read More

如何变成一个有趣的人


昨天晚上,被好友阿何约去参加她的同事聚餐。本是想带我去增加气氛的,但最终没想到却适得其反。

在整个聚餐的过程中,我都始终一人独自闷头吃饭,也不怎么喜欢跟人交流,时不时有人问我问题,我也只是非常简单地予以回复。

Read More

Spark读取本地配置文件

spark读取本地配置文件
在1.6等1.x版本中读取本地配置文件需要先在submit脚本中添加如下配置

1
--files /data/apps/config.properties

然后在spark的main方法中就可以使用如下方式读取了

1
2
3
4
val filePath = "config.properties"
LogUtil.info(filePath)
val props = new Properties()
props.load(new FileInputStream(filePath))

具体的代码可以参考

Read More

Intellij创建scala工程

  1. Create New Project , 之后选择Maven项目,Project SDK选择1.8 , Next
  2. GroupId: hohode , Artifactid: scala_test , Version: 1 , Next
  3. Finish
  4. 右下角弹框点击,Enable Auto-Import
  5. 在项目名scala_test上右键,Add Framework Support…
  6. 选择Scala,Use library: scala-sdk-2.11.8 , OK
  7. 在main下新建scala目录
  8. 在新建的scala目录上右键,Mark Directory as -> Sources Root
  9. 在scala目录上右键,New -> scala class -> name:FirstScala, kind: Object
  10. 1
    2
    3
    4
    5
    6
    object FirstScala {
    def main(args: Array[String]): Unit = {
    val v = 1 + 2
    print(v)
    }
    }
  11. 运行测试

参考 IntelliJ IDEA创建Maven项目–Scala

Elasticsearch Scroll_id一直不变的原因

为什么scroll_id不变
Short answer: yes, if you have a single shard index (as seems to be in your case) - it is expected behavior, but it can happen even if you have multiple shards. Longer answer: the scroll basically contains a list of shards where your search is running plus information about how to find your scroll data on each shard. As you exhaust results from each shard, you will notice that the scroll id becomes shorter, because we no longer need to search these shards and therefore don’t need to list them on scroll. But if only have one shard or all shards will get processed at the same time, your scroll id might never change. Saying this, I wouldn’t rely on this behavior since it might change in the future and always copy scroll id from the previous response.