Hive学习

在Spark配置hive的hive.metastore.uris

在spark的conf/hive-site.xml中配置

1
2
3
4
<property>
<name>hive.metastore.uris</name>
<value>thrift://hadoop3:9083</value>
</property>

hive.metastore.uris可以看cloudera manager的群集->Hive->实例:
Hive Metastore Server对应的服务器地址。

秒杀一切udf的reflect函数

hive新特性reflect函数介绍

Hive collect_set,concat_ws

collect_set 是 Hive 内置的一个聚合函数, 它返回一个消除了重复元素的对象集合, 其返回值类型是 array 。

1
select av_seq, concat_ws(',', collect_set(cp_seq)) from dw.smbrandcp group by av_seq;

Hive获取array数组长度

使用size(array)方法

1
select size(split(cook_ids)) from user_cook_recommed

java.io.IOException: Failed to run job : Application rejected by queue placement policy

有可能是没有使用正确的用户去执行hive命令,比如本该用hdfs用户的,但是却用root用户执行了。

为hive -e设置mapper数量和MapReduce的name

1
2
hive -e "set mapred.reduce.tasks=1; $insert_sql "
hive -e "set mapred.job.name=justtest; use eqxdb; select phone,count(1) from base_user group by phone limit 10;"

使用sqoop创建hive表

1
/bin/sqoop import -D mapred.job.name=sqoop_import_ods_yqc_earn_record  --connect jdbc:oracle:thin:@server1:1521:test --username 'root' --password '123'  --table EARN_RECORD --hive-import -m 1  --hive-drop-import-delims --hive-table ods.earn_record --hive-overwrite --where ' rownum < 1 '  --columns 'ID,AD_TASK_ID,USER_ID,EARN_XD,CREATE_TIME,PV,UV,SCENE_PV,SCENE_UV,CLICKS,SCENE_ID,AD_POSITION_ID,PUBLISHER_ID'  --map-column-hive PV=int,UV=int,SCENE_PV=int,SCENE_UV=int,CLICKS=int,ID=string,AD_TASK_ID=string,EARN_XD=string,SCENE_ID=string,AD_POSITION_ID=string,PUBLISHER_ID=string

Spark知识

引入包

1
2
3
val spark = builder.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._

lit function is for adding literal values as a column。

1
2
import org.apache.spark.sql.functions._
df.withColumn("D", lit(750))

https://stackoverflow.com/questions/38587609/spark-add-new-column-with-the-same-value-in-scala/38588307

submit提交参数

添加额外的jar包
–jars /data/apps/publib/mysql-connector-java-5.1.38.jar

提交失败时查看日志

1
yarn logs -applicationId application_1520296293387_121203

在spark中读取hive表数据

warehouse=/user/hive/warehouse
metastore=thrift://hadoop001.iu.com:9083

1
2
3
4
5
6
7
8
9
10
11
12
13
14
val argsMap = LangUtil.parseCommandLineArguments(args)
val warehouse = argsMap.get("warehouse")
val metastore = argsMap.get("metastore")

val builder = SparkSession.builder()
.config("spark.sql.warehouse.dir",warehouse)
.config("hive.metastore.uris",metastore)
.enableHiveSupport()
val spark = builder.getOrCreate()
import spark.implicits._

spark.sql("select * from d_pv where create_date='2018-06-07' limit 20 ").show(20,false)

spark.close()

提取字段中的部分值 regexp_extract

1
spark.sql(" select regexp_extract(url,'(http://store.hohode.com/scene-)(.*?)(\\\\.html)',2) product_id , d_i , cats.act act from (select explode(cats) cats , d_i from tracker )  ")

agg的使用

1
2
import org.apache.spark.sql.functions._
df.filter(" act = '"+btnType+"' ").agg((sum("price"))).first().get(0)

为列命名

1
2
3
4
val lookup = Map("_1" -> "id","_2" -> "type","_3" -> "cookie_id")
val rlog = spark.read.textFile(path)
val rlog1 = rlog.map(_.split("#")).map(x=>(getSceneId(x(15)) ,x(21),x(3) ))
val rlog2 = rlog1.select(rlog1.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)

错误1

Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._
可能是少了隐式转换

1
import sparkSession.implicits._

https://stackoverflow.com/questions/38664972/why-is-unable-to-find-encoder-for-type-stored-in-a-dataset-when-creating-a-dat

Elasticsearch知识

elasticSearch(5.3.0)的评分机制的研究

lasticSearch 6.X 不再支持多个doc_type

###

mapping设置 [日期设置](https://zhuanlan.zhihu.com/p/34240906)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
PUT user/log/_mapping
{
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_max_word"
},
"datetime": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}

elasticsearch的keyword与text的区别(5.4)
keyword:存储数据时候,不会分词建立索引

text:存储数据时候,会自动分词,并生成索引(这是很智能的,但在有些字段里面是没用的,所以对于有些字段使用text则浪费了空间)。

通过Shell命令行给钉钉发消息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/bash
#script_name:alert_to_DingDing.sh

subject="$1"
message="$2"

token="faaf1acd861a395bfb58170e17043188734cf0a1edc0ac6034a68ee2c6664de0"
ip=`/sbin/ifconfig -a|grep inet|grep -v 127.0.0.1|grep -v inet6|awk '{print $2}'|tr -d "addr:"`
hostname=`hostname`
body="$hostname[$ip]$message"

function sendMessageToDingding(){
url="https://oapi.dingtalk.com/robot/send?access_token=${token}"
UA="Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

result=`curl -XPOST -s -L -H "Content-Type:application/json" -H "charset:utf-8" $url -d "
{
\"msgtype\": \"text\",
\"text\": {
\"content\": \"$1\n$2\"
}
}"`
echo $result
}

sendMessageToDingding $subject $body

Linux在脚本中切换目录和用户

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
cd `dirname $0`
pwd
dir=$(pwd)
echo $dir
su - hdfs <<EOF

echo "start inner $(date)"
pwd
cd $dir
pwd
ls
#./sync.py
echo "end inner $(date)"

exit;
EOF

cd dirname $0 切换到脚本文件所在的目录。
dir=$(pwd) 将当前目录赋值给dir。

1
2
3
su - hdfs <<EOF
exit;
EOF

切换用户进行执行。
cd $dir 切换目录。

Js网页截图

先看一个可以运行的示例(可能需要翻墙):
example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
<!DOCTYPE html>
<html>
<head>
<meta name="layout" content="main">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
<script type="text/javascript" src="https://html2canvas.hertzen.com/dist/html2canvas.js"></script>
<script type="text/javascript" src="http://hongru.github.io/proj/canvas2image/canvas2image.js"></script>

<script type="text/javascript" >
$(document).ready( function(){
$(".example1").on("click", function(event) {

html2canvas(document.querySelector("#capture")).then(canvas => {
document.body.appendChild(canvas);
});

});

$("#download").on("click", function(event) {
html2canvas(document.querySelector("#capture")).then(canvas => {
Canvas2Image.saveAsImage(canvas, canvas.width, canvas.height,"png","hello")
});

});
});

</script>
</head>
<body>

<div id="capture" style="padding: 10px; background: #f5da55">
<h3>这真是一个强大的工具</h3>
<h4 style="color: #000; ">Hello world!</h4>
<h3>这真是一个强大的工具</h3>
</div>
<input id="download" type="button" value="下载">
<input class="example1" type="button" value="截图">
生成界面如下:
</body>
</html>

由于官方的Canvas2Image代码不能设置下载的文件名,有人提了一个PR,但是好像没有合并到主分支,所以用的时候可以用那个可以设置下载图片文件名的分支,这是可以设置图片文件名的分支

下边的参考网页很重要

https://segmentfault.com/a/1190000011478657
https://html2canvas.hertzen.com/
设置图片文件名
网页截图、涂鸦的js库