hive&mysql正则过滤乱码

相关链接


目录


1.前言

  • 需求:数仓使用的hive,上游业务库之一是mysql数据库。数据接入(使用自研平台)时发现字段中文名有乱码,要求在接入前先检查所有字段中文为乱码的表,整理出来,要求上游整改完毕后再接入。
  • 思路:使用正则表达式
    • mysql库使用正则过滤乱码字段
    • 元数据接入hive后,使用hive正则过滤乱码字段

2.mysql


  • mysql 建表 + 造数据
drop table account;
CREATE TABLE IF NOT EXISTS account (
uid int(11) DEFAULT NULL COMMENT '主键`',
uname varchar(10) DEFAULT NULL COMMENT '层级',
money int(11) DEFAULT NULL COMMENT '金额'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO account (uid, uname, money) VALUES
(1, 'asA123', 123),
(2, '中文字符', 123),
(3, 'abc', 123),
(4, 'ABC', 123),
(5, '12345', 123),
(6, 'ç»„ç»‡ç»“æž„ç‰ˆæœ¬ç¼–å ·   ', 123),
(7, '层级 ', 123);

2.1 取出所有乱码数据


  • 这里要取出来uname这一列中存在乱码的数据,其中id=6和id=7的存在乱码情况
  • 其中HEX是为了返回十六进制值的字符串表示形式,mysql不支持正则\u4e00-\u9fa5来过滤中文,需要转成十六进制后再通过正则匹配
-- 使用正则表达式
select 
	* 
from 
	account t1
where 
	1=1 
	and hex(t1.uname) not REGEXP 'e[4-9][0-9a-f]{4}' 
	and t1.uname not REGEXP '[a-z]|[0-9]|[-]' 
	and t1.uname <> ''
  • 查询结果

hive&mysql正则过滤乱码


2.2 取出库中所有乱码的字段中文注释


SELECT
	'MySQL' AS "DB_TYPE",
	T2.TABLE_SCHEMA AS "库名",
	T2.TABLE_NAME AS "表名",
	T2.TABLE_COMMENT AS "表中文名",
	T1.ORDINAL_POSITION AS "序号",
	T1.COLUMN_NAME AS "字段名",
	T1.COLUMN_COMMENT AS "字段中文",
	T1.COLUMN_TYPE AS SOURCE_COMBINE,
	CASE 
		WHEN T1.IS_NULLABLE = 'YES' 
			THEN 'N'
		WHEN T1.IS_NULLABLE = 'NO' 
			THEN 'Y'	
		ELSE 
			NULL
	END AS "非空",
	CASE
		WHEN T1.COLUMN_KEY='PRI' 
			THEN 'Y'
		ELSE 
			NULL
	END AS "主键", 
	CASE 
		WHEN T2.TABLE_TYPE = 'base table' 
			THEN "表"
		WHEN T2.TABLE_TYPE = 'view' 
			THEN "视图"
		WHEN T2.TABLE_TYPE = 'system view' 
			THEN "MySQL系统表"
		ELSE NULL
	END AS "表/视图",
	T2.TABLE_ROWS AS "数据量",
	T1.DATA_TYPE AS "DATA_TYPE",
	T1.CHARACTER_OCTET_LENGTH AS "字节数",
	T1.CHARACTER_MAXIMUM_LENGTH AS  "长度",
	T1.NUMERIC_PRECISION AS "精度",
	T1.NUMERIC_SCALE AS "标度"
FROM 
	information_Schema.`COLUMNS` T1
LEFT JOIN 
	information_schema.TABLES T2 ON T1.TABLE_NAME = T2.TABLE_NAME AND T1.TABLE_SCHEMA = T2.TABLE_SCHEMA
WHERE 
	1=1 
	and hex(t1.COLUMN_COMMENT) not REGEXP 'e[4-9][0-9a-f]{4}' 
	and t1.COLUMN_COMMENT not REGEXP '[a-z]|[0-9]|[-]' 
	and t1.COLUMN_COMMENT <> ''
ORDER BY
	T2.TABLE_SCHEMA,	-- 库名
	T2.TABLE_NAME,		-- 表名
	T1.ORDINAL_POSITION	-- 字段序号

  • 查询结果

hive&mysql正则过滤乱码


3.hive


  • t1,t2,t3,t4 存放的是接入服务的元数据信息,接入服务都是通过此服务的元数据信息进行建表的
select 
   t4.`host` as `主机`,
   t4.`port` as `端口`,
   t3.database_name as `库名`,
   t2.table_name as `表名`,
   t1.column_name as `字段名`,
   t1.column_comment as `字段中文`
from 
	stg.stg_databus_t_columns t1
	left join stg.stg_t_tables t2 on t1.table_id = t2.id
	left join stg.stg_t_databases t3 on t2.database_id = t3.id
	left join stg.stg_t_source_database_info t4 on t3.source_database_id = t4.id
where 
 	t1.pt='20211031000000' 
	and t1.column_comment not REGEXP '[\\s\\w\\d\u4e00-\u9fa5]|[-]' 
	and t1.column_comment <> ''
  • 查询结果(hive只能在平台上查询)

hive&mysql正则过滤乱码


21/11/01

M

上一篇:MySQL事务隔离级别实战演示


下一篇:Backup App's data without rooting the phone