TO_JSON_STRINGを使って行全体をまとめて比較する
概要
- バッチで実行するクエリを変更した後に、前後のテーブルで1行全体を比較し、変化があった行を調べたい。
- 複数列をまとめてJOINのキーにしてしまいたい。
- ARRAY型、STRUCT型でも手軽にJOINしたい。
ということが最近あった。愚直やると、
- 対象のすべての列を書き出す
- またNULL->非NULL(もしくはその逆)の変化を検知するためには、NULLを含む値の比較はNULLで返ってくるため、NULLIFなどで回避する
といった対応が必要です。
こういったとき、TO_JSON_STRINGが使えます。
今回のキーワード
TO_JSON_STRING
クエリ
サブクエリrace1_unnestとrace2_unnestを比較対象とします。
race2_unnest に存在しない、race1_unnestの行を、LEFT JOINで出力します。
race2_unnestは、race1_unnestの値を書き換える形で生成しています。(クエリのコメント部分参照)
WITH race1 AS ( SELECT "800M" AS race, [STRUCT("Rudisha" as name,"aiueo" as splits), STRUCT("Makhloufi" as name,"aiueo" as splits), STRUCT("Murphy" as name,NULL as splits), STRUCT("Bosse" as name,"aiueo" as splits), STRUCT("Rotich" as name,"aiueo" as splits), STRUCT("Lewandowski" as name,"aiueo" as splits), STRUCT("Kipketer" as name,"aiueo" as splits), STRUCT("Berian" as name,"aiueo" as splits), STRUCT("Nathan" as name,"aiueo" as splits), STRUCT("David" as name,NULL as splits)] AS participants), race1_unnest AS ( SELECT race, p.*, FROM race1,race1.participants AS p ), race2 AS ( SELECT "800M" AS race, [STRUCT("Rudisha" as name, "aiueo" as splits), STRUCT("Makhloufi" as name,NULL as splits), #変更 非NULL -> NULL STRUCT("Murphy" as name,NULL as splits), #変化なし STRUCT("Bosse" as name,"aiueo" as splits), STRUCT("Rotich" as name,"aiueo" as splits), STRUCT("Lewandowski" as name,"aiueo" as splits), STRUCT("Kipketer" as name,"aiueo" as splits), STRUCT("Berian" as name,"aiueo" as splits), STRUCT("Nathan" as name,"kakikukeko" as splits), #変更 STRUCT("David" as name,"yey" as splits) #変更 NULL -> 非NULL ] AS participants), race2_unnest AS ( SELECT race, p.*, FROM race2,race2.participants AS p ) SELECT t1, t2 FROM race1_unnest AS t1 LEFT JOIN race2_unnest AS t2 ON TO_JSON_STRING(t1) = TO_JSON_STRING(t2) WHERE t2.race IS NULL
アウトプット
解説
JOINに1列づつ書くのではなく、TO_JSON_STRING(t1) = TO_JSON_STRING(t2)
と書くことで、行をまとめて比較しています。
愚直に書いた場合との比較
アウトプット
NULL = NULL の返り値はNULL(TRUEでもFALSEでもない)なので、比較する列にNULLが含まれる行は、上記のように変化がくてもも出力してしまいます。
クエリは以下です。
WITH race1 AS ( SELECT "800M" AS race, [STRUCT("Rudisha" as name,"aiueo" as splits), STRUCT("Makhloufi" as name,"aiueo" as splits), STRUCT("Murphy" as name,NULL as splits),#変化なし###########この行のこと STRUCT("Bosse" as name,"aiueo" as splits), STRUCT("Rotich" as name,"aiueo" as splits), STRUCT("Lewandowski" as name,"aiueo" as splits), STRUCT("Kipketer" as name,"aiueo" as splits), STRUCT("Berian" as name,"aiueo" as splits), STRUCT("Nathan" as name,"aiueo" as splits), STRUCT("David" as name,NULL as splits)] AS participants), race1_unnest AS ( SELECT race, p.*, FROM race1,race1.participants AS p ), race2 AS ( SELECT "800M" AS race, [STRUCT("Rudisha" as name, "aiueo" as splits), STRUCT("Makhloufi" as name,NULL as splits), #変更 非NULL -> NULL STRUCT("Murphy" as name,NULL as splits), #変化なし###########この行のこと STRUCT("Bosse" as name,"aiueo" as splits), STRUCT("Rotich" as name,"aiueo" as splits), STRUCT("Lewandowski" as name,"aiueo" as splits), STRUCT("Kipketer" as name,"aiueo" as splits), STRUCT("Berian" as name,"aiueo" as splits), STRUCT("Nathan" as name,"kakikukeko" as splits), #変更 STRUCT("David" as name,"yey" as splits) #変更 NULL -> 非NULL ] AS participants), race2_unnest AS ( SELECT race, p.*, FROM race2,race2.participants AS p ) SELECT t1, t2 FROM race1_unnest AS t1 LEFT JOIN race2_unnest AS t2 ON t1.race = t2.race AND t1.name = t2.name AND t1.splits = t2.splits WHERE t2.race IS NULL
参考
ARRAY型、STRUCT型にも対応しています。
WITH race1 AS ( SELECT "800M" AS race, [STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits), STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits), STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits), STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits), STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits), STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits), STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits), STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits), STRUCT("Nathan" as name, ARRAY<FLOAT64>[] as splits), STRUCT("David" as name, NULL as splits)] AS participants), race1_unnest AS ( SELECT race, p.* FROM race1,race1.participants AS p ), race2 AS ( SELECT "800M" AS race, [STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits), STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits), STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits), STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits), STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits), STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits), STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits), STRUCT("Berian" as name, [23.7, 26.1, 27.0, 293.0] as splits), #この値が変化している STRUCT("Nathan" as name, ARRAY<FLOAT64>[] as splits), STRUCT("David" as name, NULL as splits)] AS participants), race2_unnest AS ( SELECT race, p.* FROM race2,race2.participants AS p ) SELECT t1, t2 FROM race1_unnest AS t1 LEFT JOIN race2_unnest AS t2 ON TO_JSON_STRING(t1) = TO_JSON_STRING(t2) WHERE t2.race IS NULL
GROUP BY とwindow関数を同時に書く
概要
一旦サブクエリを挟んでいる人も多いのではないでしょうか。
今回はサンプルとして、Google Analyticsのサンプルデータを使用して、transactionの日次合計と月次合計を同時に出す。
アウトプットのイメージ
解説の方に記載。
今回のキーワード
- window関数
クエリ
SELECT _TABLE_SUFFIX AS _date, SUBSTR(_TABLE_SUFFIX,1,6) AS _month, SUM(totals.transactions) AS daily_transactions, SUM(SUM(totals.transactions)) OVER(partition by SUBSTR(_TABLE_SUFFIX,1,6)) AS monthly_transactions FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX >= "20170501" GROUP BY 1,2 ORDER BY 1,2
解説
window関数(SUM)の中に集計関数(SUM)を書けば良い。
データの確認
_date | _month | daily_transactions | monthly_transactions |
---|---|---|---|
20170501 | 201705 | 78 | 1160 |
20170502 | 201705 | 74 | 1160 |
20170503 | 201705 | 70 | 1160 |
20170504 | 201705 | 49 | 1160 |
20170505 | 201705 | 36 | 1160 |
20170506 | 201705 | 12 | 1160 |
20170507 | 201705 | 11 | 1160 |
20170508 | 201705 | 53 | 1160 |
20170509 | 201705 | 47 | 1160 |
20170510 | 201705 | 40 | 1160 |
20170511 | 201705 | 46 | 1160 |
20170512 | 201705 | 54 | 1160 |
20170513 | 201705 | 23 | 1160 |
20170514 | 201705 | 10 | 1160 |
20170515 | 201705 | 37 | 1160 |
20170516 | 201705 | 53 | 1160 |
20170517 | 201705 | 41 | 1160 |
20170518 | 201705 | 35 | 1160 |
20170519 | 201705 | 41 | 1160 |
20170520 | 201705 | 30 | 1160 |
20170521 | 201705 | 22 | 1160 |
20170522 | 201705 | 25 | 1160 |
20170523 | 201705 | 33 | 1160 |
20170524 | 201705 | 36 | 1160 |
20170525 | 201705 | 35 | 1160 |
20170526 | 201705 | 31 | 1160 |
20170527 | 201705 | 16 | 1160 |
20170528 | 201705 | 9 | 1160 |
20170529 | 201705 | 18 | 1160 |
20170530 | 201705 | 41 | 1160 |
20170531 | 201705 | 54 | 1160 |
20170601 | 201706 | 35 | 971 |
20170602 | 201706 | 37 | 971 |
20170603 | 201706 | 15 | 971 |
20170604 | 201706 | 15 | 971 |
20170605 | 201706 | 35 | 971 |
20170606 | 201706 | 29 | 971 |
20170607 | 201706 | 38 | 971 |
20170608 | 201706 | 26 | 971 |
20170609 | 201706 | 36 | 971 |
20170610 | 201706 | 13 | 971 |
20170611 | 201706 | 29 | 971 |
20170612 | 201706 | 61 | 971 |
20170613 | 201706 | 48 | 971 |
20170614 | 201706 | 46 | 971 |
20170615 | 201706 | 36 | 971 |
20170616 | 201706 | 36 | 971 |
20170617 | 201706 | 18 | 971 |
20170618 | 201706 | 28 | 971 |
20170619 | 201706 | 41 | 971 |
20170620 | 201706 | 36 | 971 |
20170621 | 201706 | 37 | 971 |
20170622 | 201706 | 36 | 971 |
20170623 | 201706 | 29 | 971 |
20170624 | 201706 | 10 | 971 |
20170625 | 201706 | 19 | 971 |
20170626 | 201706 | 32 | 971 |
20170627 | 201706 | 40 | 971 |
20170628 | 201706 | 31 | 971 |
20170629 | 201706 | 45 | 971 |
20170630 | 201706 | 34 | 971 |
20170701 | 201707 | 3 | 1072 |
20170702 | 201707 | 8 | 1072 |
20170703 | 201707 | 15 | 1072 |
20170704 | 201707 | 7 | 1072 |
20170705 | 201707 | 42 | 1072 |
20170706 | 201707 | 31 | 1072 |
20170707 | 201707 | 40 | 1072 |
20170708 | 201707 | 14 | 1072 |
20170709 | 201707 | 19 | 1072 |
20170710 | 201707 | 47 | 1072 |
20170711 | 201707 | 42 | 1072 |
20170712 | 201707 | 49 | 1072 |
20170713 | 201707 | 65 | 1072 |
20170714 | 201707 | 47 | 1072 |
20170715 | 201707 | 16 | 1072 |
20170716 | 201707 | 28 | 1072 |
20170717 | 201707 | 53 | 1072 |
20170718 | 201707 | 51 | 1072 |
20170719 | 201707 | 57 | 1072 |
20170720 | 201707 | 41 | 1072 |
20170721 | 201707 | 42 | 1072 |
20170722 | 201707 | 18 | 1072 |
20170723 | 201707 | 16 | 1072 |
20170724 | 201707 | 40 | 1072 |
20170725 | 201707 | 38 | 1072 |
20170726 | 201707 | 42 | 1072 |
20170727 | 201707 | 52 | 1072 |
20170728 | 201707 | 46 | 1072 |
20170729 | 201707 | 19 | 1072 |
20170730 | 201707 | 22 | 1072 |
20170731 | 201707 | 62 | 1072 |
20170801 | 201708 | 45 | 45 |
montlyの集計
SELECT SUBSTR(_TABLE_SUFFIX,1,6) AS _month, SUM(totals.transactions) AS monthly_transactions FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX >= "20170501" GROUP BY 1 ORDER BY 1
_month | monthly_transactions |
---|---|
201705 | 1160 |
201706 | 971 |
201707 | 1072 |
201708 | 45 |
以上です。
BigQueryで横持ちのデータを縦持ちに変換(マルチカーソルを使う)
閑話
小ネタです。次はBigQueryMLあたりを調べてを書いていきたいと思っています。
異常検知はもう少し勉強してから再開していきます。
概要
横持ちのデータを縦持ちに変換します。
縦持ちにできて嬉しいことは色々あるかと思いますが、例えばBigQueryMLでのパラメータ選択をする際に、全てのカラムの相関を計算しやすくなるなどひょっとしたらあるかもしれません。
キーワード
- 力技
- マルチカーソルを使う
アウトプットのイメージ
変換前
id | col1 | col2 | .... | col100 |
---|---|---|---|---|
1 | 2 | 4 | .... | 15 |
2 | 3 | 1 | .... | 2 |
3 | 5 | 0 | .... | 3 |
4 | 1 | 1 | .... | 1 |
変換後
id | col_name | value |
---|---|---|
1 | col1 | 2 |
1 | col2 | 4 |
1 | col3 | 0 |
... | ... | ... |
1 | col100 | 15 |
2 | col1 | 3 |
2 | col2 | 1 |
... | ... | ... |
2 | col100 | 2 |
以下id 3,4のcol1~col100が続く
この様にデータが縦持ちになると、idをキーにjoinして、各カラムの相関の計算が出来たりします。
クエリ
with sample AS ( #10万行*100列のデータを生成 SELECT SUM(1) OVER(ORDER BY RAND()) AS id, RAND() AS col1, RAND() AS col2, RAND() AS col3, RAND() AS col4, RAND() AS col5, RAND() AS col6, RAND() AS col7, RAND() AS col8, RAND() AS col9, RAND() AS col10, RAND() AS col11, RAND() AS col12, RAND() AS col13, RAND() AS col14, RAND() AS col15, RAND() AS col16, RAND() AS col17, RAND() AS col18, RAND() AS col19, RAND() AS col20, RAND() AS col21, RAND() AS col22, RAND() AS col23, RAND() AS col24, RAND() AS col25, RAND() AS col26, RAND() AS col27, RAND() AS col28, RAND() AS col29, RAND() AS col30, RAND() AS col31, RAND() AS col32, RAND() AS col33, RAND() AS col34, RAND() AS col35, RAND() AS col36, RAND() AS col37, RAND() AS col38, RAND() AS col39, RAND() AS col40, RAND() AS col41, RAND() AS col42, RAND() AS col43, RAND() AS col44, RAND() AS col45, RAND() AS col46, RAND() AS col47, RAND() AS col48, RAND() AS col49, RAND() AS col50, RAND() AS col51, RAND() AS col52, RAND() AS col53, RAND() AS col54, RAND() AS col55, RAND() AS col56, RAND() AS col57, RAND() AS col58, RAND() AS col59, RAND() AS col60, RAND() AS col61, RAND() AS col62, RAND() AS col63, RAND() AS col64, RAND() AS col65, RAND() AS col66, RAND() AS col67, RAND() AS col68, RAND() AS col69, RAND() AS col70, RAND() AS col71, RAND() AS col72, RAND() AS col73, RAND() AS col74, RAND() AS col75, RAND() AS col76, RAND() AS col77, RAND() AS col78, RAND() AS col79, RAND() AS col80, RAND() AS col81, RAND() AS col82, RAND() AS col83, RAND() AS col84, RAND() AS col85, RAND() AS col86, RAND() AS col87, RAND() AS col88, RAND() AS col89, RAND() AS col90, RAND() AS col91, RAND() AS col92, RAND() AS col93, RAND() AS col94, RAND() AS col95, RAND() AS col96, RAND() AS col97, RAND() AS col98, RAND() AS col99, RAND() AS col100 FROM UNNEST(GENERATE_ARRAY(0,100000)) ), melt_arr AS ( SELECT id, [ STRUCT(col1 AS value,"col1" AS col_name), STRUCT(col2 AS value,"col2" AS col_name), STRUCT(col3 AS value,"col3" AS col_name), STRUCT(col4 AS value,"col4" AS col_name), STRUCT(col5 AS value,"col5" AS col_name), STRUCT(col6 AS value,"col6" AS col_name), STRUCT(col7 AS value,"col7" AS col_name), STRUCT(col8 AS value,"col8" AS col_name), STRUCT(col9 AS value,"col9" AS col_name), STRUCT(col10 AS value,"col10" AS col_name), STRUCT(col11 AS value,"col11" AS col_name), STRUCT(col12 AS value,"col12" AS col_name), STRUCT(col13 AS value,"col13" AS col_name), STRUCT(col14 AS value,"col14" AS col_name), STRUCT(col15 AS value,"col15" AS col_name), STRUCT(col16 AS value,"col16" AS col_name), STRUCT(col17 AS value,"col17" AS col_name), STRUCT(col18 AS value,"col18" AS col_name), STRUCT(col19 AS value,"col19" AS col_name), STRUCT(col20 AS value,"col20" AS col_name), STRUCT(col21 AS value,"col21" AS col_name), STRUCT(col22 AS value,"col22" AS col_name), STRUCT(col23 AS value,"col23" AS col_name), STRUCT(col24 AS value,"col24" AS col_name), STRUCT(col25 AS value,"col25" AS col_name), STRUCT(col26 AS value,"col26" AS col_name), STRUCT(col27 AS value,"col27" AS col_name), STRUCT(col28 AS value,"col28" AS col_name), STRUCT(col29 AS value,"col29" AS col_name), STRUCT(col30 AS value,"col30" AS col_name), STRUCT(col31 AS value,"col31" AS col_name), STRUCT(col32 AS value,"col32" AS col_name), STRUCT(col33 AS value,"col33" AS col_name), STRUCT(col34 AS value,"col34" AS col_name), STRUCT(col35 AS value,"col35" AS col_name), STRUCT(col36 AS value,"col36" AS col_name), STRUCT(col37 AS value,"col37" AS col_name), STRUCT(col38 AS value,"col38" AS col_name), STRUCT(col39 AS value,"col39" AS col_name), STRUCT(col40 AS value,"col40" AS col_name), STRUCT(col41 AS value,"col41" AS col_name), STRUCT(col42 AS value,"col42" AS col_name), STRUCT(col43 AS value,"col43" AS col_name), STRUCT(col44 AS value,"col44" AS col_name), STRUCT(col45 AS value,"col45" AS col_name), STRUCT(col46 AS value,"col46" AS col_name), STRUCT(col47 AS value,"col47" AS col_name), STRUCT(col48 AS value,"col48" AS col_name), STRUCT(col49 AS value,"col49" AS col_name), STRUCT(col50 AS value,"col50" AS col_name), STRUCT(col51 AS value,"col51" AS col_name), STRUCT(col52 AS value,"col52" AS col_name), STRUCT(col53 AS value,"col53" AS col_name), STRUCT(col54 AS value,"col54" AS col_name), STRUCT(col55 AS value,"col55" AS col_name), STRUCT(col56 AS value,"col56" AS col_name), STRUCT(col57 AS value,"col57" AS col_name), STRUCT(col58 AS value,"col58" AS col_name), STRUCT(col59 AS value,"col59" AS col_name), STRUCT(col60 AS value,"col60" AS col_name), STRUCT(col61 AS value,"col61" AS col_name), STRUCT(col62 AS value,"col62" AS col_name), STRUCT(col63 AS value,"col63" AS col_name), STRUCT(col64 AS value,"col64" AS col_name), STRUCT(col65 AS value,"col65" AS col_name), STRUCT(col66 AS value,"col66" AS col_name), STRUCT(col67 AS value,"col67" AS col_name), STRUCT(col68 AS value,"col68" AS col_name), STRUCT(col69 AS value,"col69" AS col_name), STRUCT(col70 AS value,"col70" AS col_name), STRUCT(col71 AS value,"col71" AS col_name), STRUCT(col72 AS value,"col72" AS col_name), STRUCT(col73 AS value,"col73" AS col_name), STRUCT(col74 AS value,"col74" AS col_name), STRUCT(col75 AS value,"col75" AS col_name), STRUCT(col76 AS value,"col76" AS col_name), STRUCT(col77 AS value,"col77" AS col_name), STRUCT(col78 AS value,"col78" AS col_name), STRUCT(col79 AS value,"col79" AS col_name), STRUCT(col80 AS value,"col80" AS col_name), STRUCT(col81 AS value,"col81" AS col_name), STRUCT(col82 AS value,"col82" AS col_name), STRUCT(col83 AS value,"col83" AS col_name), STRUCT(col84 AS value,"col84" AS col_name), STRUCT(col85 AS value,"col85" AS col_name), STRUCT(col86 AS value,"col86" AS col_name), STRUCT(col87 AS value,"col87" AS col_name), STRUCT(col88 AS value,"col88" AS col_name), STRUCT(col89 AS value,"col89" AS col_name), STRUCT(col90 AS value,"col90" AS col_name), STRUCT(col91 AS value,"col91" AS col_name), STRUCT(col92 AS value,"col92" AS col_name), STRUCT(col93 AS value,"col93" AS col_name), STRUCT(col94 AS value,"col94" AS col_name), STRUCT(col95 AS value,"col95" AS col_name), STRUCT(col96 AS value,"col96" AS col_name), STRUCT(col97 AS value,"col97" AS col_name), STRUCT(col98 AS value,"col98" AS col_name), STRUCT(col99 AS value,"col99" AS col_name), STRUCT(col100 AS value,"col100" AS col_name) ] AS col_val_arr FROM sample ), melted AS ( SELECT id, col_val_arr.col_name, col_val_arr.value FROM melt_arr,melt_arr.col_val_arr AS col_val_arr ), _corr AS ( #相関を計算する SELECT t1.col_name AS col_name1, t2.col_name AS col_name2, CORR(t1.value,t2.value) FROM melted AS t1 INNER JOIN melted AS t2 USING (id) WHERE t1.col_name > t2.col_name GROUP BY 1,2 ) SELECT * FROM _corr
解説
sampleデータの生成
spread sheetを使って連番で各行を生成しました。
melt_arr
col1 ~ col100を
ARRAY<STRUCT<value FLOAT64,col_name STRING>>
形式の1列に変換します。
なおこういった変換の際にはマルチカーソルが便利です。
BigQueryのエディタで、altを押しながら上下にドラッグをするとカーソルを複数出すことが出来ます。
BigQueryでマルチカーソルを使う - アクセス解析担当者のBigQuery日記
イメージ
行っている処理は
1. まずcol1 ~ col100を貼り付けます。
2. altを押しながらドラッグでマルチカーソルを出します。
3. STRUCTまで書きます
4. colxxを各カーソルで選択し、コピーします
5. AS value," まで書きます
6. ペーストします。すると4でコピーした、行ごとのcolxxを貼り付けることが出来ます
7. " AS col_name), まで書きます。
8. エスケープキーを押して、マルチカーソルから抜け出し、col100の行だけカンマが不要なので削除します
BigQueryに限らず大体のエディタで使えます。
melted
melt_arrをUNNESTしています。
このサブクエリ、アウトプットのイメージで書いたようにデータが縦持ちになっています。
_corr
各列の相関を計算しています。
所感
いつか使う時が来る気がする。
以上です。
BigQueryで異常検知(複数系列に対応するよう改修)
概要
前回まで、異常検知系では指標の実装をメインに書いてきました(2記事だけですが)。一方で、前回のブログの終わりでも書きましたが、別パターンのsampleデータの生成と複数系列への対応も行いたいです。
BigQueryで時系列データから異常部位検出 (EMA 指数移動平均) - アクセス解析担当者のBigQuery日記
指標系を一通り作成し、改めて別パターンサンプルデータの生成・複数系列への対応をしようと考えていました。が、先にこちらをやってしまい、今後異常検知系指標をどんどん追加していき、それぞれ比較をした方が各指標の比較や活用がしやすくなると思いましたので、今回はこれをやっていきます。
今回のキーワード
- sampleデータの生成
- 以前作成したクエリの改修
方針
- 複数系列のsampleデータ生成を行います。
- 各系列には、その系列を表すid列、時系列を表す_time列、データとして_data列をもたせます。
- また各系列はスケールが変わるよう、rand()で生成した乱数を系列ごとに定数倍します。
- クエリを、上記データに適用できるよう改修します。
- 最後に各系列と、直近の異常度の推移を確認します。
クエリ
1.複数系列のsampleデータ生成
WITH sample AS ( WITH sample_step1 AS ( SELECT id, generate_array(1,100) AS _time_arr #乱数生成用の箱兼_time列用の数値生成。第2引数が系列の長さ FROM UNNEST(generate_array(1,10)) AS id #第2引数が系列の数 ) SELECT id, _time, RAND() * id AS _data #系列ごとに定数倍。今回は簡単のためid倍する。 FROM sample_step1,sample_step1._time_arr AS _time ) SELECT * FROM sample
2.クエリを、上記データに適用できるよう改修
EMA
BigQueryで時系列データから異常部位検出 (EMA 指数移動平均) - アクセス解析担当者のBigQuery日記
CREATE TEMPORARY FUNCTION calc_min_distance(part_arr ARRAY<FLOAT64>,part_arr_arr ARRAY<STRUCT< inner_arr ARRAY<FLOAT64>>>) RETURNS FLOAT64 LANGUAGE js AS """ function calc_distance(x,y) { if(x.length != y.length) { return null } else { var d = 0 for(var i = 0;i<x.length;i++){ d += Math.pow(x[i] - y[i],2) } return Math.sqrt(d) } } var ret = null if(part_arr_arr == null){ return ret } for(var i = 0 ; i < part_arr_arr.length;i++){ var distance = calc_distance(part_arr,part_arr_arr[i].inner_arr) if(distance!=null){ if(ret==null || ret > distance){ ret = distance } } } return ret """; WITH sample AS ( WITH sample_step1 AS ( SELECT id, generate_array(1,100) AS _time_arr #乱数生成用の箱兼_time列用の数値生成。第2引数が系列の長さ FROM UNNEST(generate_array(1,10)) AS id #第2引数が系列の数 ) SELECT id, _time, RAND() * id AS _data #系列ごとに定数倍。今回は簡単のためid倍する。 FROM sample_step1,sample_step1._time_arr AS _time ), ##################### #k近傍法による異常検知の実装 ##################### km1 AS ( #部分系列の生成 SELECT *, ARRAY_AGG(_data) OVER(PARTITION BY id ORDER BY _time asc ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS part_arr # 6 PRECEDINGを、使用したい系列の長さにより変更してください FROM sample ), km2 AS ( #比較したいウィンドウで、部分系列の系列を生成します。 SELECT *, ARRAY_AGG(STRUCT(part_arr)) OVER(PARTITION BY id ORDER BY _time asc ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING) AS part_arr_arr # 30 PRECEDINGを、比較したいウィンドウサイズにより変更してください FROM km1 ), km_result AS ( #k近傍法結果をkm_min_distanceとして出力 SELECT id, _time, _data, calc_min_distance(part_arr,part_arr_arr) AS km_min_distance FROM km2 ), ##################### #EMA(指数移動平均) n=7 ##################### ema1 AS ( SELECT *, ARRAY_REVERSE(ARRAY_AGG(_data) OVER(partition by id ORDER BY _time asc ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)) AS part_arr #自身とn-1行前までをARRAYにする。また自身が初めに来るようにソートを逆順にする。 FROM km_result #km_resultサブクエリを引き継ぐ ), ema_result AS ( SELECT *, (#サブクエリでEMAを計算する SELECT SUM((2/(7+1))*POW((1-2/(7+1)),k) * f_i_k) #数式部分 FROM UNNEST(part_arr) AS f_i_k WITH OFFSET AS k #ARRAYの位置も一緒に出力。数式で言うkになる。 ) AS ema FROM ema1 ) SELECT * except(part_arr) FROM ema_result WHERE id IN(1,5,10) #可視化用にデータを絞る
解説
sampleデータ生成については特に解説しない。可視化のパートでグラフ出します。
k近傍法、EMAについて
主だった変更点としては、どちらも部分系列生成のウィンドウ関数で、
partition by id ||<l を追加し、 >|sql| #以前はorder by id asc order by _time asc
に変更しました。
以前書いたsqlでは、1系列でidを時間の区切りを表すのに使っていましたが、今回のsqlで言うとそれが_time列になるからです。紛らわしくて申し訳ありません。
可視化と所感
k近傍法
1系列で見ていた時との違いとして、idが大きい系列ほど、異常度が高くなってしまっていることがわかります。
EMA
k近傍法同様、idが大きい系列ほど、高くなっています。
結論
よってどちらも何かしら正規化した指標が別途必要であると考えており、今後は別指標の作成と併せてその辺もやりたいと思います。先は長い。
以上です。
BigQueryで時系列データから異常部位検出 (EMA 指数移動平均)
概要
異常検知系2個目。例により実装しやすそうなので採用。 直近に重みを付けた移動平均です。
こちらの本を参考にしました。 www.kspub.co.jp
その他参考にした記事 www.moneypartners.co.jp
定義
なお、
今後も異常検知系を増やし、最終的には統合してなんらか閾値を設定し、異常判定されたデータ・bot生成まで行いたい。
アウトプットのイメージ
データポータルで可視化
今回のキーワード
- EMA
- ウィンドウ関数
- サンプルデータ生成
- UNNEST() WITH OFFSET
サンプルデータの生成は、以下の記事と同じ。
クエリ
WITH sample1 AS ( #およそ周期が7のデータacを生成する SELECT x AS id, 3 + 10 * SIN( x * 3.14 /3.5) + RAND() AS x FROM UNNEST(GENERATE_ARRAY(0,100)) AS x ), sample2 AS ( #異常値を載せたデータを作成する SELECT *, IF(id = 50,x + 10,x) AS abnomal_x FROM sample1 ) ##################### #EMA(指数移動平均) n=7 ##################### SELECT *, (#サブクエリでEMAを計算する SELECT SUM((2/(7+1))*POW((1-2/(7+1)),k) * f_i_k) #数式部分 FROM UNNEST(part_arr) AS f_i_k WITH OFFSET AS k #ARRAYの位置も一緒に出力。数式で言うkになる。 ) AS ema FROM ( SELECT *, ARRAY_REVERSE(ARRAY_AGG(abnomal_x) OVER(ORDER BY id asc ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)) AS part_arr #自身とn-1行前までをARRAYにする。また自身が初めに来るようにソートを逆順にする。 FROM sample2 ) AS t1
解説
sample生成については省略 。id順で見て真ん中あたりに異常値を乗せています。
t1
EMA計算用に自身と直近データを、ARRAY_AGGをウィンドウ関数をとして使って配列に格納しています。
ARRAY_AGG(abnomal_x ORDER BY id desc)
とせずに
ARRAY_REVERSE(ARRAY_AGG(abnomal_x)
としているのは、ARRAY_AGG(x ORDER BY y) をウィンドウ関数では使えない仕様らしいからです。
ウィンドウ関数についてはこちら
標準 SQL での分析関数のコンセプト | BigQuery | Google Cloud
サブクエリでEMAを計算
数式部分については特になし。
UNNEST(part_arr) AS f_i_k WITH OFFSET AS k
について、ここでいうf_i_kとkは、数式でいうとkになり、kは自身からいくつ前のデータであるかを表しています。
UNNEST(array) WITH OFFSET AS alias
と書くことで、配列要素の0始まりの位置を、arrayを展開すると同時に列として出力してくれます。
UNNEST は入力 ARRAY 内の要素の順序を破壊します。配列要素のインデックスを含む 2 番目の列を返すには、オプションの WITH OFFSET 句を使用します
Standard SQL Query Syntax | BigQuery | Google Cloud
可視化と所感
クエリ結果画面で、データポータルで調べるを押し、グラフを作成。
一応反応はしているが、今回生成したサンプルのようにはっきりとした周期的なデータに対しては、k近傍法ほど大きくは動かない。
k近傍法もそうだが、今後は別のパターンのsampleデータを生成して試していこうと思う。
また、現状1つの系列にしか対応できない書き方をしているのでその点改めつつ、またスケールが異なる複数の系列を評価できるようにしていきたい。
こんなデータを処理したい。
time | category | data |
---|---|---|
1 | A | 12 |
1 | B | 104 |
2 | A | 13 |
2 | B | 150 |
3 | A | 25 |
2 | B | 83 |
以上です。
BigQueryでひらがなをカタカナに変換するSQLUDFを作成する
概要
BigQuery上でひらがなをカタカナに変換するSQLUDFを作成する。
処理の流れは、以下の記事と同じである。
sem-aa-bq.hatenablog.com
アウトプットのイメージ
CREATE TEMP FUNCTION hirakata(x STRING) AS ( udf ); SELECT --コメント部分が実行結果 hirakata("あいうeoエオ") --アイウeoエオ
今回のキーワード
- UDF
- TO_CODE_POINTS 関数
- ARRAYの処理
クエリ
CREATE TEMP FUNCTION hirakata(x STRING) AS ( CODE_POINTS_TO_STRING( ARRAY( SELECT IF(x BETWEEN 12353 AND 12438, x + 96, x) FROM UNNEST( TO_CODE_POINTS(x)) AS x ) )); SELECT hirakata("あいうeoエオ"), --アイウeoエオ
BigQueryでフィボナッチ数列を継ぎ足すUDFを作成する(クイズ形式)
概要
ネタがないので柔らかめな内容。興味のある方はクイズだと思って解いてみてください。
問題
ARRAYを受け取ったら、ARRAYの最後の要素と、最後から1つ前の要素を合計し、それを継ぎ足したARRAYを作るようなUDFを作ってください。UDFについてはこちらを参照。
cloud.google.com
javaScriptでUDFを書くと任意の長さのフィボナッチ数列も作成出来ますが、ほぼjavaScriptの問題になるのでそれはしない。
例
[1,1,2,3,5] を引数として渡されたら、3+5を計算して、[1,1,2,3,5,8] を返す。
今回のキーワード
- UDF
- ARRAYを操作する関数に慣れる
回答フォーマット
CREATE TEMP FUNCTION fib(x ARRAY<INT64>) AS ( -------------------------------- --ここにUDFを書いてください-- -------------------------------- ); SELECT --コメントの値が帰ってくれば正解です fib([1,1]), --[1,1,2] fib(fib([1,1])), --[1,1,2,3] fib(fib(fib(fib(fib([1,1]))))) --[1,1,2,3,5,8,13]
回答例はページ下部にあります。
回答例
CREATE TEMP FUNCTION fib(x ARRAY<INT64>) AS ( ARRAY_CONCAT(x,[(x[ORDINAL(ARRAY_LENGTH(x))]) +x[ORDINAL(ARRAY_LENGTH(x) - 1)]]) ); SELECT fib([1,1]), --[1,1,2] fib(fib([1,1])), --[1,1,2,3] fib(fib(fib(fib(fib([1,1]))))) --[1,1,2,3,5,8,13]
以上です。