アクセス解析担当者のBigQuery日記

普段はアクセス解析やウェブ広告の運用をしています。主にBigQueryについて、記録していきます。

TO_JSON_STRINGを使って行全体をまとめて比較する

概要

  • バッチで実行するクエリを変更した後に、前後のテーブルで1行全体を比較し、変化があった行を調べたい。
  • 複数列をまとめてJOINのキーにしてしまいたい。
  • ARRAY型、STRUCT型でも手軽にJOINしたい。

ということが最近あった。愚直やると、

  • 対象のすべての列を書き出す
  • またNULL->非NULL(もしくはその逆)の変化を検知するためには、NULLを含む値の比較はNULLで返ってくるため、NULLIFなどで回避する

といった対応が必要です。
こういったとき、TO_JSON_STRINGが使えます。

今回のキーワード

TO_JSON_STRING

クエリ

サブクエリrace1_unnestとrace2_unnestを比較対象とします。
race2_unnest に存在しない、race1_unnestの行を、LEFT JOINで出力します。
race2_unnestは、race1_unnestの値を書き換える形で生成しています。(クエリのコメント部分参照)

WITH race1 AS (
     SELECT "800M" AS race,
       [STRUCT("Rudisha" as name,"aiueo" as splits),
        STRUCT("Makhloufi" as name,"aiueo" as splits),
        STRUCT("Murphy" as name,NULL as splits),
        STRUCT("Bosse" as name,"aiueo" as splits),
        STRUCT("Rotich" as name,"aiueo" as splits),
        STRUCT("Lewandowski" as name,"aiueo" as splits),
        STRUCT("Kipketer" as name,"aiueo" as splits),
        STRUCT("Berian" as name,"aiueo" as splits),
        STRUCT("Nathan" as name,"aiueo" as splits),
        STRUCT("David" as name,NULL as splits)]
        AS participants),
      race1_unnest AS (
        SELECT
          race,
          p.*,
        FROM
          race1,race1.participants AS p
      ),
    race2 AS (
     SELECT "800M" AS race,
       [STRUCT("Rudisha" as name, "aiueo" as splits), 
        STRUCT("Makhloufi" as name,NULL as splits), #変更 非NULL -> NULL
        STRUCT("Murphy" as name,NULL as splits), #変化なし
        STRUCT("Bosse" as name,"aiueo" as splits),
        STRUCT("Rotich" as name,"aiueo" as splits),
        STRUCT("Lewandowski" as name,"aiueo" as splits),
        STRUCT("Kipketer" as name,"aiueo" as splits),
        STRUCT("Berian" as name,"aiueo" as splits),
        STRUCT("Nathan" as name,"kakikukeko" as splits), #変更 
        STRUCT("David" as name,"yey" as splits) #変更 NULL -> 非NULL
        ]
        AS participants),
      race2_unnest AS (
        SELECT
          race,
          p.*,
        FROM
          race2,race2.participants AS p
      )

SELECT
  t1,
  t2
FROM
  race1_unnest  AS t1
LEFT JOIN
  race2_unnest AS t2
ON 
  TO_JSON_STRING(t1) = TO_JSON_STRING(t2)
WHERE t2.race IS NULL

アウトプット

f:id:sem-aa-bq:20200504012733p:plain

解説

JOINに1列づつ書くのではなく、TO_JSON_STRING(t1) = TO_JSON_STRING(t2)
と書くことで、行をまとめて比較しています。

愚直に書いた場合との比較

アウトプット

f:id:sem-aa-bq:20200504014005p:plain

NULL = NULL の返り値はNULL(TRUEでもFALSEでもない)なので、比較する列にNULLが含まれる行は、上記のように変化がくてもも出力してしまいます。

クエリは以下です。

WITH race1 AS (
     SELECT "800M" AS race,
       [STRUCT("Rudisha" as name,"aiueo" as splits),
        STRUCT("Makhloufi" as name,"aiueo" as splits),
        STRUCT("Murphy" as name,NULL as splits),#変化なし###########この行のこと
        STRUCT("Bosse" as name,"aiueo" as splits),
        STRUCT("Rotich" as name,"aiueo" as splits),
        STRUCT("Lewandowski" as name,"aiueo" as splits),
        STRUCT("Kipketer" as name,"aiueo" as splits),
        STRUCT("Berian" as name,"aiueo" as splits),
        STRUCT("Nathan" as name,"aiueo" as splits),
        STRUCT("David" as name,NULL as splits)]
        AS participants),
      race1_unnest AS (
        SELECT
          race,
          p.*,
        FROM
          race1,race1.participants AS p
      ),
    race2 AS (
     SELECT "800M" AS race,
       [STRUCT("Rudisha" as name, "aiueo" as splits), 
        STRUCT("Makhloufi" as name,NULL as splits), #変更 非NULL -> NULL
        STRUCT("Murphy" as name,NULL as splits), #変化なし###########この行のこと
        STRUCT("Bosse" as name,"aiueo" as splits),
        STRUCT("Rotich" as name,"aiueo" as splits),
        STRUCT("Lewandowski" as name,"aiueo" as splits),
        STRUCT("Kipketer" as name,"aiueo" as splits),
        STRUCT("Berian" as name,"aiueo" as splits),
        STRUCT("Nathan" as name,"kakikukeko" as splits), #変更 
        STRUCT("David" as name,"yey" as splits) #変更 NULL -> 非NULL
        ]
        AS participants),
      race2_unnest AS (
        SELECT
          race,
          p.*,
        FROM
          race2,race2.participants AS p
      )

SELECT
  t1,
  t2
FROM
  race1_unnest  AS t1
LEFT JOIN
  race2_unnest AS t2
ON 
  t1.race = t2.race
  AND t1.name = t2.name 
  AND t1.splits = t2.splits
WHERE t2.race IS NULL

参考

ARRAY型、STRUCT型にも対応しています。

WITH race1 AS (
     SELECT "800M" AS race,
       [STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits),
        STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits),
        STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits),
        STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits),
        STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits),
        STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits),
        STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits),
        STRUCT("Berian" as name, [23.7, 26.1, 27.0, 29.3] as splits),
        STRUCT("Nathan" as name, ARRAY<FLOAT64>[] as splits),
        STRUCT("David" as name, NULL as splits)]
        AS participants),
      race1_unnest AS (
        SELECT
          race,
          p.*
        FROM
          race1,race1.participants AS p
      ),
    race2 AS (
     SELECT "800M" AS race,
       [STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits),
        STRUCT("Makhloufi" as name, [24.5, 25.4, 26.6, 26.1] as splits),
        STRUCT("Murphy" as name, [23.9, 26.0, 27.0, 26.0] as splits),
        STRUCT("Bosse" as name, [23.6, 26.2, 26.5, 27.1] as splits),
        STRUCT("Rotich" as name, [24.7, 25.6, 26.9, 26.4] as splits),
        STRUCT("Lewandowski" as name, [25.0, 25.7, 26.3, 27.2] as splits),
        STRUCT("Kipketer" as name, [23.2, 26.1, 27.3, 29.4] as splits),
        STRUCT("Berian" as name, [23.7, 26.1, 27.0, 293.0] as splits), #この値が変化している
        STRUCT("Nathan" as name, ARRAY<FLOAT64>[] as splits),
        STRUCT("David" as name, NULL as splits)]
        AS participants),
      race2_unnest AS (
        SELECT
          race,
          p.*
        FROM
          race2,race2.participants AS p
      )

SELECT
  t1,
  t2
FROM
  race1_unnest  AS t1
LEFT JOIN
  race2_unnest AS t2
ON TO_JSON_STRING(t1) = TO_JSON_STRING(t2)
WHERE t2.race IS NULL


f:id:sem-aa-bq:20200504010615p:plain

GROUP BY とwindow関数を同時に書く

概要

一旦サブクエリを挟んでいる人も多いのではないでしょうか。
今回はサンプルとして、Google Analyticsのサンプルデータを使用して、transactionの日次合計と月次合計を同時に出す。

アウトプットのイメージ

解説の方に記載。

今回のキーワード

  1. window関数

クエリ

SELECT
  _TABLE_SUFFIX AS _date,
  SUBSTR(_TABLE_SUFFIX,1,6) AS _month,
  SUM(totals.transactions) AS daily_transactions,
  SUM(SUM(totals.transactions)) OVER(partition by SUBSTR(_TABLE_SUFFIX,1,6)) AS monthly_transactions
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _TABLE_SUFFIX >= "20170501"
GROUP BY 1,2
ORDER BY 1,2

解説

window関数(SUM)の中に集計関数(SUM)を書けば良い。

データの確認

_date _month daily_transactions monthly_transactions
20170501 201705 78 1160
20170502 201705 74 1160
20170503 201705 70 1160
20170504 201705 49 1160
20170505 201705 36 1160
20170506 201705 12 1160
20170507 201705 11 1160
20170508 201705 53 1160
20170509 201705 47 1160
20170510 201705 40 1160
20170511 201705 46 1160
20170512 201705 54 1160
20170513 201705 23 1160
20170514 201705 10 1160
20170515 201705 37 1160
20170516 201705 53 1160
20170517 201705 41 1160
20170518 201705 35 1160
20170519 201705 41 1160
20170520 201705 30 1160
20170521 201705 22 1160
20170522 201705 25 1160
20170523 201705 33 1160
20170524 201705 36 1160
20170525 201705 35 1160
20170526 201705 31 1160
20170527 201705 16 1160
20170528 201705 9 1160
20170529 201705 18 1160
20170530 201705 41 1160
20170531 201705 54 1160
20170601 201706 35 971
20170602 201706 37 971
20170603 201706 15 971
20170604 201706 15 971
20170605 201706 35 971
20170606 201706 29 971
20170607 201706 38 971
20170608 201706 26 971
20170609 201706 36 971
20170610 201706 13 971
20170611 201706 29 971
20170612 201706 61 971
20170613 201706 48 971
20170614 201706 46 971
20170615 201706 36 971
20170616 201706 36 971
20170617 201706 18 971
20170618 201706 28 971
20170619 201706 41 971
20170620 201706 36 971
20170621 201706 37 971
20170622 201706 36 971
20170623 201706 29 971
20170624 201706 10 971
20170625 201706 19 971
20170626 201706 32 971
20170627 201706 40 971
20170628 201706 31 971
20170629 201706 45 971
20170630 201706 34 971
20170701 201707 3 1072
20170702 201707 8 1072
20170703 201707 15 1072
20170704 201707 7 1072
20170705 201707 42 1072
20170706 201707 31 1072
20170707 201707 40 1072
20170708 201707 14 1072
20170709 201707 19 1072
20170710 201707 47 1072
20170711 201707 42 1072
20170712 201707 49 1072
20170713 201707 65 1072
20170714 201707 47 1072
20170715 201707 16 1072
20170716 201707 28 1072
20170717 201707 53 1072
20170718 201707 51 1072
20170719 201707 57 1072
20170720 201707 41 1072
20170721 201707 42 1072
20170722 201707 18 1072
20170723 201707 16 1072
20170724 201707 40 1072
20170725 201707 38 1072
20170726 201707 42 1072
20170727 201707 52 1072
20170728 201707 46 1072
20170729 201707 19 1072
20170730 201707 22 1072
20170731 201707 62 1072
20170801 201708 45 45

montlyの集計

SELECT
  SUBSTR(_TABLE_SUFFIX,1,6) AS _month,
  SUM(totals.transactions) AS monthly_transactions
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE _TABLE_SUFFIX >= "20170501"
GROUP BY 1
ORDER BY 1
_month monthly_transactions
201705 1160
201706 971
201707 1072
201708 45

以上です。

BigQueryで横持ちのデータを縦持ちに変換(マルチカーソルを使う)

閑話

小ネタです。次はBigQueryMLあたりを調べてを書いていきたいと思っています。
異常検知はもう少し勉強してから再開していきます。

概要

横持ちのデータを縦持ちに変換します。
縦持ちにできて嬉しいことは色々あるかと思いますが、例えばBigQueryMLでのパラメータ選択をする際に、全てのカラムの相関を計算しやすくなるなどひょっとしたらあるかもしれません。

キーワード

- 力技
- マルチカーソルを使う

アウトプットのイメージ

変換前

id col1 col2 .... col100
1 2 4 .... 15
2 3 1 .... 2
3 5 0 .... 3
4 1 1 .... 1

変換後

id col_name value
1 col1 2
1 col2 4
1 col3 0
... ... ...
1 col100 15
2 col1 3
2 col2 1
... ... ...
2 col100 2

以下id 3,4のcol1~col100が続く

この様にデータが縦持ちになると、idをキーにjoinして、各カラムの相関の計算が出来たりします。

クエリ

with sample AS (
#10万行*100列のデータを生成
SELECT
  SUM(1) OVER(ORDER BY RAND()) AS id,
  RAND() AS col1,
  RAND() AS col2,
  RAND() AS col3,
  RAND() AS col4,
  RAND() AS col5,
  RAND() AS col6,
  RAND() AS col7,
  RAND() AS col8,
  RAND() AS col9,
  RAND() AS col10,
  RAND() AS col11,
  RAND() AS col12,
  RAND() AS col13,
  RAND() AS col14,
  RAND() AS col15,
  RAND() AS col16,
  RAND() AS col17,
  RAND() AS col18,
  RAND() AS col19,
  RAND() AS col20,
  RAND() AS col21,
  RAND() AS col22,
  RAND() AS col23,
  RAND() AS col24,
  RAND() AS col25,
  RAND() AS col26,
  RAND() AS col27,
  RAND() AS col28,
  RAND() AS col29,
  RAND() AS col30,
  RAND() AS col31,
  RAND() AS col32,
  RAND() AS col33,
  RAND() AS col34,
  RAND() AS col35,
  RAND() AS col36,
  RAND() AS col37,
  RAND() AS col38,
  RAND() AS col39,
  RAND() AS col40,
  RAND() AS col41,
  RAND() AS col42,
  RAND() AS col43,
  RAND() AS col44,
  RAND() AS col45,
  RAND() AS col46,
  RAND() AS col47,
  RAND() AS col48,
  RAND() AS col49,
  RAND() AS col50,
  RAND() AS col51,
  RAND() AS col52,
  RAND() AS col53,
  RAND() AS col54,
  RAND() AS col55,
  RAND() AS col56,
  RAND() AS col57,
  RAND() AS col58,
  RAND() AS col59,
  RAND() AS col60,
  RAND() AS col61,
  RAND() AS col62,
  RAND() AS col63,
  RAND() AS col64,
  RAND() AS col65,
  RAND() AS col66,
  RAND() AS col67,
  RAND() AS col68,
  RAND() AS col69,
  RAND() AS col70,
  RAND() AS col71,
  RAND() AS col72,
  RAND() AS col73,
  RAND() AS col74,
  RAND() AS col75,
  RAND() AS col76,
  RAND() AS col77,
  RAND() AS col78,
  RAND() AS col79,
  RAND() AS col80,
  RAND() AS col81,
  RAND() AS col82,
  RAND() AS col83,
  RAND() AS col84,
  RAND() AS col85,
  RAND() AS col86,
  RAND() AS col87,
  RAND() AS col88,
  RAND() AS col89,
  RAND() AS col90,
  RAND() AS col91,
  RAND() AS col92,
  RAND() AS col93,
  RAND() AS col94,
  RAND() AS col95,
  RAND() AS col96,
  RAND() AS col97,
  RAND() AS col98,
  RAND() AS col99,
  RAND() AS col100
FROM
  UNNEST(GENERATE_ARRAY(0,100000))
),
melt_arr AS 
(
SELECT
  id,
[
STRUCT(col1 AS value,"col1" AS col_name),
STRUCT(col2 AS value,"col2" AS col_name),
STRUCT(col3 AS value,"col3" AS col_name),
STRUCT(col4 AS value,"col4" AS col_name),
STRUCT(col5 AS value,"col5" AS col_name),
STRUCT(col6 AS value,"col6" AS col_name),
STRUCT(col7 AS value,"col7" AS col_name),
STRUCT(col8 AS value,"col8" AS col_name),
STRUCT(col9 AS value,"col9" AS col_name),
STRUCT(col10 AS value,"col10" AS col_name),
STRUCT(col11 AS value,"col11" AS col_name),
STRUCT(col12 AS value,"col12" AS col_name),
STRUCT(col13 AS value,"col13" AS col_name),
STRUCT(col14 AS value,"col14" AS col_name),
STRUCT(col15 AS value,"col15" AS col_name),
STRUCT(col16 AS value,"col16" AS col_name),
STRUCT(col17 AS value,"col17" AS col_name),
STRUCT(col18 AS value,"col18" AS col_name),
STRUCT(col19 AS value,"col19" AS col_name),
STRUCT(col20 AS value,"col20" AS col_name),
STRUCT(col21 AS value,"col21" AS col_name),
STRUCT(col22 AS value,"col22" AS col_name),
STRUCT(col23 AS value,"col23" AS col_name),
STRUCT(col24 AS value,"col24" AS col_name),
STRUCT(col25 AS value,"col25" AS col_name),
STRUCT(col26 AS value,"col26" AS col_name),
STRUCT(col27 AS value,"col27" AS col_name),
STRUCT(col28 AS value,"col28" AS col_name),
STRUCT(col29 AS value,"col29" AS col_name),
STRUCT(col30 AS value,"col30" AS col_name),
STRUCT(col31 AS value,"col31" AS col_name),
STRUCT(col32 AS value,"col32" AS col_name),
STRUCT(col33 AS value,"col33" AS col_name),
STRUCT(col34 AS value,"col34" AS col_name),
STRUCT(col35 AS value,"col35" AS col_name),
STRUCT(col36 AS value,"col36" AS col_name),
STRUCT(col37 AS value,"col37" AS col_name),
STRUCT(col38 AS value,"col38" AS col_name),
STRUCT(col39 AS value,"col39" AS col_name),
STRUCT(col40 AS value,"col40" AS col_name),
STRUCT(col41 AS value,"col41" AS col_name),
STRUCT(col42 AS value,"col42" AS col_name),
STRUCT(col43 AS value,"col43" AS col_name),
STRUCT(col44 AS value,"col44" AS col_name),
STRUCT(col45 AS value,"col45" AS col_name),
STRUCT(col46 AS value,"col46" AS col_name),
STRUCT(col47 AS value,"col47" AS col_name),
STRUCT(col48 AS value,"col48" AS col_name),
STRUCT(col49 AS value,"col49" AS col_name),
STRUCT(col50 AS value,"col50" AS col_name),
STRUCT(col51 AS value,"col51" AS col_name),
STRUCT(col52 AS value,"col52" AS col_name),
STRUCT(col53 AS value,"col53" AS col_name),
STRUCT(col54 AS value,"col54" AS col_name),
STRUCT(col55 AS value,"col55" AS col_name),
STRUCT(col56 AS value,"col56" AS col_name),
STRUCT(col57 AS value,"col57" AS col_name),
STRUCT(col58 AS value,"col58" AS col_name),
STRUCT(col59 AS value,"col59" AS col_name),
STRUCT(col60 AS value,"col60" AS col_name),
STRUCT(col61 AS value,"col61" AS col_name),
STRUCT(col62 AS value,"col62" AS col_name),
STRUCT(col63 AS value,"col63" AS col_name),
STRUCT(col64 AS value,"col64" AS col_name),
STRUCT(col65 AS value,"col65" AS col_name),
STRUCT(col66 AS value,"col66" AS col_name),
STRUCT(col67 AS value,"col67" AS col_name),
STRUCT(col68 AS value,"col68" AS col_name),
STRUCT(col69 AS value,"col69" AS col_name),
STRUCT(col70 AS value,"col70" AS col_name),
STRUCT(col71 AS value,"col71" AS col_name),
STRUCT(col72 AS value,"col72" AS col_name),
STRUCT(col73 AS value,"col73" AS col_name),
STRUCT(col74 AS value,"col74" AS col_name),
STRUCT(col75 AS value,"col75" AS col_name),
STRUCT(col76 AS value,"col76" AS col_name),
STRUCT(col77 AS value,"col77" AS col_name),
STRUCT(col78 AS value,"col78" AS col_name),
STRUCT(col79 AS value,"col79" AS col_name),
STRUCT(col80 AS value,"col80" AS col_name),
STRUCT(col81 AS value,"col81" AS col_name),
STRUCT(col82 AS value,"col82" AS col_name),
STRUCT(col83 AS value,"col83" AS col_name),
STRUCT(col84 AS value,"col84" AS col_name),
STRUCT(col85 AS value,"col85" AS col_name),
STRUCT(col86 AS value,"col86" AS col_name),
STRUCT(col87 AS value,"col87" AS col_name),
STRUCT(col88 AS value,"col88" AS col_name),
STRUCT(col89 AS value,"col89" AS col_name),
STRUCT(col90 AS value,"col90" AS col_name),
STRUCT(col91 AS value,"col91" AS col_name),
STRUCT(col92 AS value,"col92" AS col_name),
STRUCT(col93 AS value,"col93" AS col_name),
STRUCT(col94 AS value,"col94" AS col_name),
STRUCT(col95 AS value,"col95" AS col_name),
STRUCT(col96 AS value,"col96" AS col_name),
STRUCT(col97 AS value,"col97" AS col_name),
STRUCT(col98 AS value,"col98" AS col_name),
STRUCT(col99 AS value,"col99" AS col_name),
STRUCT(col100 AS value,"col100" AS col_name)
] AS col_val_arr
FROM
  sample
),
melted AS (
SELECT
  id,
  col_val_arr.col_name,
  col_val_arr.value
FROM
  melt_arr,melt_arr.col_val_arr AS col_val_arr
),
_corr AS (
#相関を計算する
SELECT
  t1.col_name AS col_name1,
  t2.col_name AS col_name2,
  CORR(t1.value,t2.value)
FROM
  melted AS t1
INNER JOIN
  melted AS t2
USING
  (id)
WHERE
  t1.col_name > t2.col_name
GROUP BY
  1,2
)
SELECT
  *
FROM
  _corr

解説

sampleデータの生成

spread sheetを使って連番で各行を生成しました。

melt_arr

col1 ~ col100を
ARRAY<STRUCT<value FLOAT64,col_name STRING>>
形式の1列に変換します。
なおこういった変換の際にはマルチカーソルが便利です。

BigQueryのエディタで、altを押しながら上下にドラッグをするとカーソルを複数出すことが出来ます。
BigQueryでマルチカーソルを使う - アクセス解析担当者のBigQuery日記

イメージ

f:id:sem-aa-bq:20190422234017g:plain
マルチカーソル

行っている処理は
1. まずcol1 ~ col100を貼り付けます。
2. altを押しながらドラッグでマルチカーソルを出します。
3. STRUCTまで書きます
4. colxxを各カーソルで選択し、コピーします
5. AS value," まで書きます
6. ペーストします。すると4でコピーした、行ごとのcolxxを貼り付けることが出来ます
7. " AS col_name), まで書きます。
8. エスケープキーを押して、マルチカーソルから抜け出し、col100の行だけカンマが不要なので削除します

BigQueryに限らず大体のエディタで使えます。

melted

melt_arrをUNNESTしています。
このサブクエリ、アウトプットのイメージで書いたようにデータが縦持ちになっています。

_corr

各列の相関を計算しています。

所感

いつか使う時が来る気がする。
以上です。

BigQueryで異常検知(複数系列に対応するよう改修)

概要

前回まで、異常検知系では指標の実装をメインに書いてきました(2記事だけですが)。一方で、前回のブログの終わりでも書きましたが、別パターンのsampleデータの生成と複数系列への対応も行いたいです。
BigQueryで時系列データから異常部位検出 (EMA 指数移動平均) - アクセス解析担当者のBigQuery日記

指標系を一通り作成し、改めて別パターンサンプルデータの生成・複数系列への対応をしようと考えていました。が、先にこちらをやってしまい、今後異常検知系指標をどんどん追加していき、それぞれ比較をした方が各指標の比較や活用がしやすくなると思いましたので、今回はこれをやっていきます。

今回のキーワード

  1. sampleデータの生成
  2. 以前作成したクエリの改修

方針

  1. 複数系列のsampleデータ生成を行います。
    1. 各系列には、その系列を表すid列、時系列を表す_time列、データとして_data列をもたせます。
    2. また各系列はスケールが変わるよう、rand()で生成した乱数を系列ごとに定数倍します。
  2. クエリを、上記データに適用できるよう改修します。
  3. 最後に各系列と、直近の異常度の推移を確認します。

クエリ

1.複数系列のsampleデータ生成

WITH
  sample AS (
    WITH
    sample_step1 AS (
      SELECT
        id,
        generate_array(1,100) AS _time_arr #乱数生成用の箱兼_time列用の数値生成。第2引数が系列の長さ
      FROM
        UNNEST(generate_array(1,10)) AS id #第2引数が系列の数
      )
      SELECT
        id,
        _time,
        RAND() * id AS _data #系列ごとに定数倍。今回は簡単のためid倍する。
      FROM
        sample_step1,sample_step1._time_arr AS _time
    )
    SELECT
      *
    FROM
      sample

2.クエリを、上記データに適用できるよう改修

EMA

BigQueryで時系列データから異常部位検出 (EMA 指数移動平均) - アクセス解析担当者のBigQuery日記

CREATE TEMPORARY FUNCTION calc_min_distance(part_arr ARRAY<FLOAT64>,part_arr_arr ARRAY<STRUCT< inner_arr ARRAY<FLOAT64>>>) RETURNS FLOAT64
LANGUAGE js AS """
  function calc_distance(x,y) {
    if(x.length != y.length) {
      return null
    } else {
      var d = 0
      for(var i = 0;i<x.length;i++){
        d += Math.pow(x[i] - y[i],2)
      }
      return Math.sqrt(d)
    }
  }
  
  var ret = null
  if(part_arr_arr == null){
    return ret
  }
  for(var i = 0 ; i < part_arr_arr.length;i++){
    var distance = calc_distance(part_arr,part_arr_arr[i].inner_arr)
    if(distance!=null){
      if(ret==null || ret > distance){
        ret = distance
      }
    }
  }
  
  return ret
""";
WITH
  sample AS (
    WITH
    sample_step1 AS (
      SELECT
        id,
        generate_array(1,100) AS _time_arr #乱数生成用の箱兼_time列用の数値生成。第2引数が系列の長さ
      FROM
        UNNEST(generate_array(1,10)) AS id #第2引数が系列の数
      )
      SELECT
        id,
        _time,
        RAND() * id AS _data #系列ごとに定数倍。今回は簡単のためid倍する。
      FROM
        sample_step1,sample_step1._time_arr AS _time
    ),
  #####################
  #k近傍法による異常検知の実装
  #####################
  km1 AS (
     #部分系列の生成
  SELECT
     *,
     ARRAY_AGG(_data) OVER(PARTITION BY id ORDER BY _time asc ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS part_arr # 6 PRECEDINGを、使用したい系列の長さにより変更してください
  FROM sample
  ),
  km2 AS (
    #比較したいウィンドウで、部分系列の系列を生成します。
    SELECT
      *,
      ARRAY_AGG(STRUCT(part_arr)) OVER(PARTITION BY id ORDER BY _time asc ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING) AS part_arr_arr # 30 PRECEDINGを、比較したいウィンドウサイズにより変更してください
    FROM
      km1
  ),
  km_result AS (
  #k近傍法結果をkm_min_distanceとして出力
    SELECT
       id,
       _time,
       _data,
      calc_min_distance(part_arr,part_arr_arr) AS km_min_distance
    FROM
      km2
  ),
  #####################
  #EMA(指数移動平均) n=7
  #####################
  ema1 AS (
      SELECT
        *,
        ARRAY_REVERSE(ARRAY_AGG(_data) OVER(partition by id ORDER BY _time asc ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)) AS part_arr #自身とn-1行前までをARRAYにする。また自身が初めに来るようにソートを逆順にする。
      FROM
        km_result #km_resultサブクエリを引き継ぐ
  ),
  ema_result AS (
    SELECT
      *,
      (#サブクエリでEMAを計算する
        SELECT
          SUM((2/(7+1))*POW((1-2/(7+1)),k) * f_i_k) #数式部分
         FROM
          UNNEST(part_arr) AS f_i_k WITH OFFSET AS k #ARRAYの位置も一緒に出力。数式で言うkになる。
       ) AS ema
    FROM 
      ema1
  )

SELECT
  * except(part_arr)
FROM
  ema_result
WHERE id IN(1,5,10) #可視化用にデータを絞る

解説

sampleデータ生成については特に解説しない。可視化のパートでグラフ出します。

k近傍法、EMAについて

主だった変更点としては、どちらも部分系列生成のウィンドウ関数で、

partition by id
||<l
を追加し、
>|sql|
 #以前はorder by id asc
order by _time asc

に変更しました。
以前書いたsqlでは、1系列でidを時間の区切りを表すのに使っていましたが、今回のsqlで言うとそれが_time列になるからです。紛らわしくて申し訳ありません。

可視化と所感

各系列のデータ

f:id:sem-aa-bq:20190318231508p:plain
id 1,5,10 のデータのみ出力しています。
sampleデータ生成のsql内で

RAND() * id

というしているので、それぞれ分布が異なっています。

k近傍法

f:id:sem-aa-bq:20190318231458p:plain
1系列で見ていた時との違いとして、idが大きい系列ほど、異常度が高くなってしまっていることがわかります。

EMA

f:id:sem-aa-bq:20190318231453p:plain
k近傍法同様、idが大きい系列ほど、高くなっています。

結論

よってどちらも何かしら正規化した指標が別途必要であると考えており、今後は別指標の作成と併せてその辺もやりたいと思います。先は長い。


以上です。

BigQueryで時系列データから異常部位検出 (EMA 指数移動平均)

概要

異常検知系2個目。例により実装しやすそうなので採用。 直近に重みを付けた移動平均です。

こちらの本を参考にしました。 www.kspub.co.jp

その他参考にした記事 www.moneypartners.co.jp

定義


EMA^{(n)}_i = \sum_{k=0}^{n-1} \alpha(1-\alpha)^k f_{i-k}\\
\alpha = \frac{2}{n+1}\\
f_i は時刻iにおけるシグナル量\\
nは重みの係数

なお、

今後も異常検知系を増やし、最終的には統合してなんらか閾値を設定し、異常判定されたデータ・bot生成まで行いたい。

アウトプットのイメージ

データポータルで可視化

f:id:sem-aa-bq:20190308015238p:plain
EMA異常度をデータポータルで可視化

今回のキーワード

  1. EMA
  2. ウィンドウ関数
  3. サンプルデータ生成
  4. UNNEST() WITH OFFSET

サンプルデータの生成は、以下の記事と同じ。

sem-aa-bq.hatenablog.com

クエリ

WITH
  sample1 AS (
    #およそ周期が7のデータacを生成する
  SELECT
    x AS id,
    3 + 10 * SIN( x * 3.14 /3.5) + RAND() AS x
  FROM
    UNNEST(GENERATE_ARRAY(0,100)) AS x ),
    sample2 AS (
    #異常値を載せたデータを作成する
  SELECT
    *,
    IF(id = 50,x + 10,x) AS abnomal_x
  FROM
    sample1
  )
  #####################
  #EMA(指数移動平均) n=7
  #####################
  SELECT
    *,
    (#サブクエリでEMAを計算する
      SELECT
        SUM((2/(7+1))*POW((1-2/(7+1)),k) * f_i_k) #数式部分
       FROM
        UNNEST(part_arr) AS f_i_k WITH OFFSET AS k #ARRAYの位置も一緒に出力。数式で言うkになる。
     ) AS ema
  FROM 
    (
      SELECT
        *,
        ARRAY_REVERSE(ARRAY_AGG(abnomal_x) OVER(ORDER BY id asc ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)) AS part_arr #自身とn-1行前までをARRAYにする。また自身が初めに来るようにソートを逆順にする。
      FROM
        sample2
    ) AS t1

解説

sample生成については省略 。id順で見て真ん中あたりに異常値を乗せています。

t1

EMA計算用に自身と直近データを、ARRAY_AGGをウィンドウ関数をとして使って配列に格納しています。

ARRAY_AGG(abnomal_x ORDER BY id desc)
とせずに
ARRAY_REVERSE(ARRAY_AGG(abnomal_x)
としているのは、ARRAY_AGG(x ORDER BY y) をウィンドウ関数では使えない仕様らしいからです。 ウィンドウ関数についてはこちら

標準 SQL での分析関数のコンセプト  |  BigQuery  |  Google Cloud

サブクエリでEMAを計算

数式部分については特になし。

UNNEST(part_arr) AS f_i_k WITH OFFSET AS k
について、ここでいうf_i_kとkは、数式でいうf_{i-k}とkになり、kは自身からいくつ前のデータであるかを表しています。
UNNEST(array) WITH OFFSET AS alias

と書くことで、配列要素の0始まりの位置を、arrayを展開すると同時に列として出力してくれます。

UNNEST は入力 ARRAY 内の要素の順序を破壊します。配列要素のインデックスを含む 2 番目の列を返すには、オプションの WITH OFFSET 句を使用します

Standard SQL Query Syntax  |  BigQuery  |  Google Cloud

可視化と所感

クエリ結果画面で、データポータルで調べるを押し、グラフを作成。

f:id:sem-aa-bq:20190308015238p:plain
EMA異常度をデータポータルで可視化

一応反応はしているが、今回生成したサンプルのようにはっきりとした周期的なデータに対しては、k近傍法ほど大きくは動かない。
k近傍法もそうだが、今後は別のパターンのsampleデータを生成して試していこうと思う。
また、現状1つの系列にしか対応できない書き方をしているのでその点改めつつ、またスケールが異なる複数の系列を評価できるようにしていきたい。

こんなデータを処理したい。

time category data
1 A 12
1 B 104
2 A 13
2 B 150
3 A 25
2 B 83

以上です。

BigQueryでひらがなをカタカナに変換するSQLUDFを作成する

概要

BigQuery上でひらがなをカタカナに変換するSQLUDFを作成する。
処理の流れは、以下の記事と同じである。
sem-aa-bq.hatenablog.com

アウトプットのイメージ

CREATE TEMP FUNCTION hirakata(x STRING) AS (
udf
);
SELECT
  --コメント部分が実行結果
  hirakata("あいうeoエオ") --アイウeoエオ

今回のキーワード

  1. UDF
  2. TO_CODE_POINTS 関数
  3. ARRAYの処理

クエリ

  CREATE TEMP FUNCTION hirakata(x STRING) AS ( CODE_POINTS_TO_STRING( ARRAY(
      SELECT
        IF(x BETWEEN 12353
          AND 12438, x + 96, x)
      FROM
        UNNEST( TO_CODE_POINTS(x)) AS x ) ));
SELECT
  hirakata("あいうeoエオ"), --アイウeoエオ

BigQueryでフィボナッチ数列を継ぎ足すUDFを作成する(クイズ形式)

概要

ネタがないので柔らかめな内容。興味のある方はクイズだと思って解いてみてください。

問題

ARRAYを受け取ったら、ARRAYの最後の要素と、最後から1つ前の要素を合計し、それを継ぎ足したARRAYを作るようなUDFを作ってください。UDFについてはこちらを参照。
cloud.google.com
javaScriptでUDFを書くと任意の長さのフィボナッチ数列も作成出来ますが、ほぼjavaScriptの問題になるのでそれはしない。

[1,1,2,3,5] を引数として渡されたら、3+5を計算して、[1,1,2,3,5,8] を返す。

今回のキーワード

  1. UDF
  2. ARRAYを操作する関数に慣れる

回答フォーマット

CREATE TEMP FUNCTION fib(x ARRAY<INT64>) AS (
--------------------------------
--ここにUDFを書いてください--
--------------------------------
);
SELECT
--コメントの値が帰ってくれば正解です
fib([1,1]), --[1,1,2]
fib(fib([1,1])), --[1,1,2,3]
fib(fib(fib(fib(fib([1,1]))))) --[1,1,2,3,5,8,13]

参考
cloud.google.com



回答例はページ下部にあります。

















回答例

CREATE TEMP FUNCTION fib(x ARRAY<INT64>) AS (
ARRAY_CONCAT(x,[(x[ORDINAL(ARRAY_LENGTH(x))]) +x[ORDINAL(ARRAY_LENGTH(x) - 1)]])
);
SELECT
fib([1,1]), --[1,1,2]
fib(fib([1,1])), --[1,1,2,3]
fib(fib(fib(fib(fib([1,1]))))) --[1,1,2,3,5,8,13]

以上です。