アクセス解析担当者のBigQuery日記

普段はアクセス解析やウェブ広告の運用をしています。主にBigQueryについて、記録していきます。

BigQueryで横持ちのデータを縦持ちに変換(マルチカーソルを使う)

閑話

小ネタです。次はBigQueryMLあたりを調べてを書いていきたいと思っています。
異常検知はもう少し勉強してから再開していきます。

概要

横持ちのデータを縦持ちに変換します。
縦持ちにできて嬉しいことは色々あるかと思いますが、例えばBigQueryMLでのパラメータ選択をする際に、全てのカラムの相関を計算しやすくなるなどひょっとしたらあるかもしれません。

キーワード

- 力技
- マルチカーソルを使う

アウトプットのイメージ

変換前

id col1 col2 .... col100
1 2 4 .... 15
2 3 1 .... 2
3 5 0 .... 3
4 1 1 .... 1

変換後

id col_name value
1 col1 2
1 col2 4
1 col3 0
... ... ...
1 col100 15
2 col1 3
2 col2 1
... ... ...
2 col100 2

以下id 3,4のcol1~col100が続く

この様にデータが縦持ちになると、idをキーにjoinして、各カラムの相関の計算が出来たりします。

クエリ

with sample AS (
#10万行*100列のデータを生成
SELECT
  SUM(1) OVER(ORDER BY RAND()) AS id,
  RAND() AS col1,
  RAND() AS col2,
  RAND() AS col3,
  RAND() AS col4,
  RAND() AS col5,
  RAND() AS col6,
  RAND() AS col7,
  RAND() AS col8,
  RAND() AS col9,
  RAND() AS col10,
  RAND() AS col11,
  RAND() AS col12,
  RAND() AS col13,
  RAND() AS col14,
  RAND() AS col15,
  RAND() AS col16,
  RAND() AS col17,
  RAND() AS col18,
  RAND() AS col19,
  RAND() AS col20,
  RAND() AS col21,
  RAND() AS col22,
  RAND() AS col23,
  RAND() AS col24,
  RAND() AS col25,
  RAND() AS col26,
  RAND() AS col27,
  RAND() AS col28,
  RAND() AS col29,
  RAND() AS col30,
  RAND() AS col31,
  RAND() AS col32,
  RAND() AS col33,
  RAND() AS col34,
  RAND() AS col35,
  RAND() AS col36,
  RAND() AS col37,
  RAND() AS col38,
  RAND() AS col39,
  RAND() AS col40,
  RAND() AS col41,
  RAND() AS col42,
  RAND() AS col43,
  RAND() AS col44,
  RAND() AS col45,
  RAND() AS col46,
  RAND() AS col47,
  RAND() AS col48,
  RAND() AS col49,
  RAND() AS col50,
  RAND() AS col51,
  RAND() AS col52,
  RAND() AS col53,
  RAND() AS col54,
  RAND() AS col55,
  RAND() AS col56,
  RAND() AS col57,
  RAND() AS col58,
  RAND() AS col59,
  RAND() AS col60,
  RAND() AS col61,
  RAND() AS col62,
  RAND() AS col63,
  RAND() AS col64,
  RAND() AS col65,
  RAND() AS col66,
  RAND() AS col67,
  RAND() AS col68,
  RAND() AS col69,
  RAND() AS col70,
  RAND() AS col71,
  RAND() AS col72,
  RAND() AS col73,
  RAND() AS col74,
  RAND() AS col75,
  RAND() AS col76,
  RAND() AS col77,
  RAND() AS col78,
  RAND() AS col79,
  RAND() AS col80,
  RAND() AS col81,
  RAND() AS col82,
  RAND() AS col83,
  RAND() AS col84,
  RAND() AS col85,
  RAND() AS col86,
  RAND() AS col87,
  RAND() AS col88,
  RAND() AS col89,
  RAND() AS col90,
  RAND() AS col91,
  RAND() AS col92,
  RAND() AS col93,
  RAND() AS col94,
  RAND() AS col95,
  RAND() AS col96,
  RAND() AS col97,
  RAND() AS col98,
  RAND() AS col99,
  RAND() AS col100
FROM
  UNNEST(GENERATE_ARRAY(0,100000))
),
melt_arr AS 
(
SELECT
  id,
[
STRUCT(col1 AS value,"col1" AS col_name),
STRUCT(col2 AS value,"col2" AS col_name),
STRUCT(col3 AS value,"col3" AS col_name),
STRUCT(col4 AS value,"col4" AS col_name),
STRUCT(col5 AS value,"col5" AS col_name),
STRUCT(col6 AS value,"col6" AS col_name),
STRUCT(col7 AS value,"col7" AS col_name),
STRUCT(col8 AS value,"col8" AS col_name),
STRUCT(col9 AS value,"col9" AS col_name),
STRUCT(col10 AS value,"col10" AS col_name),
STRUCT(col11 AS value,"col11" AS col_name),
STRUCT(col12 AS value,"col12" AS col_name),
STRUCT(col13 AS value,"col13" AS col_name),
STRUCT(col14 AS value,"col14" AS col_name),
STRUCT(col15 AS value,"col15" AS col_name),
STRUCT(col16 AS value,"col16" AS col_name),
STRUCT(col17 AS value,"col17" AS col_name),
STRUCT(col18 AS value,"col18" AS col_name),
STRUCT(col19 AS value,"col19" AS col_name),
STRUCT(col20 AS value,"col20" AS col_name),
STRUCT(col21 AS value,"col21" AS col_name),
STRUCT(col22 AS value,"col22" AS col_name),
STRUCT(col23 AS value,"col23" AS col_name),
STRUCT(col24 AS value,"col24" AS col_name),
STRUCT(col25 AS value,"col25" AS col_name),
STRUCT(col26 AS value,"col26" AS col_name),
STRUCT(col27 AS value,"col27" AS col_name),
STRUCT(col28 AS value,"col28" AS col_name),
STRUCT(col29 AS value,"col29" AS col_name),
STRUCT(col30 AS value,"col30" AS col_name),
STRUCT(col31 AS value,"col31" AS col_name),
STRUCT(col32 AS value,"col32" AS col_name),
STRUCT(col33 AS value,"col33" AS col_name),
STRUCT(col34 AS value,"col34" AS col_name),
STRUCT(col35 AS value,"col35" AS col_name),
STRUCT(col36 AS value,"col36" AS col_name),
STRUCT(col37 AS value,"col37" AS col_name),
STRUCT(col38 AS value,"col38" AS col_name),
STRUCT(col39 AS value,"col39" AS col_name),
STRUCT(col40 AS value,"col40" AS col_name),
STRUCT(col41 AS value,"col41" AS col_name),
STRUCT(col42 AS value,"col42" AS col_name),
STRUCT(col43 AS value,"col43" AS col_name),
STRUCT(col44 AS value,"col44" AS col_name),
STRUCT(col45 AS value,"col45" AS col_name),
STRUCT(col46 AS value,"col46" AS col_name),
STRUCT(col47 AS value,"col47" AS col_name),
STRUCT(col48 AS value,"col48" AS col_name),
STRUCT(col49 AS value,"col49" AS col_name),
STRUCT(col50 AS value,"col50" AS col_name),
STRUCT(col51 AS value,"col51" AS col_name),
STRUCT(col52 AS value,"col52" AS col_name),
STRUCT(col53 AS value,"col53" AS col_name),
STRUCT(col54 AS value,"col54" AS col_name),
STRUCT(col55 AS value,"col55" AS col_name),
STRUCT(col56 AS value,"col56" AS col_name),
STRUCT(col57 AS value,"col57" AS col_name),
STRUCT(col58 AS value,"col58" AS col_name),
STRUCT(col59 AS value,"col59" AS col_name),
STRUCT(col60 AS value,"col60" AS col_name),
STRUCT(col61 AS value,"col61" AS col_name),
STRUCT(col62 AS value,"col62" AS col_name),
STRUCT(col63 AS value,"col63" AS col_name),
STRUCT(col64 AS value,"col64" AS col_name),
STRUCT(col65 AS value,"col65" AS col_name),
STRUCT(col66 AS value,"col66" AS col_name),
STRUCT(col67 AS value,"col67" AS col_name),
STRUCT(col68 AS value,"col68" AS col_name),
STRUCT(col69 AS value,"col69" AS col_name),
STRUCT(col70 AS value,"col70" AS col_name),
STRUCT(col71 AS value,"col71" AS col_name),
STRUCT(col72 AS value,"col72" AS col_name),
STRUCT(col73 AS value,"col73" AS col_name),
STRUCT(col74 AS value,"col74" AS col_name),
STRUCT(col75 AS value,"col75" AS col_name),
STRUCT(col76 AS value,"col76" AS col_name),
STRUCT(col77 AS value,"col77" AS col_name),
STRUCT(col78 AS value,"col78" AS col_name),
STRUCT(col79 AS value,"col79" AS col_name),
STRUCT(col80 AS value,"col80" AS col_name),
STRUCT(col81 AS value,"col81" AS col_name),
STRUCT(col82 AS value,"col82" AS col_name),
STRUCT(col83 AS value,"col83" AS col_name),
STRUCT(col84 AS value,"col84" AS col_name),
STRUCT(col85 AS value,"col85" AS col_name),
STRUCT(col86 AS value,"col86" AS col_name),
STRUCT(col87 AS value,"col87" AS col_name),
STRUCT(col88 AS value,"col88" AS col_name),
STRUCT(col89 AS value,"col89" AS col_name),
STRUCT(col90 AS value,"col90" AS col_name),
STRUCT(col91 AS value,"col91" AS col_name),
STRUCT(col92 AS value,"col92" AS col_name),
STRUCT(col93 AS value,"col93" AS col_name),
STRUCT(col94 AS value,"col94" AS col_name),
STRUCT(col95 AS value,"col95" AS col_name),
STRUCT(col96 AS value,"col96" AS col_name),
STRUCT(col97 AS value,"col97" AS col_name),
STRUCT(col98 AS value,"col98" AS col_name),
STRUCT(col99 AS value,"col99" AS col_name),
STRUCT(col100 AS value,"col100" AS col_name)
] AS col_val_arr
FROM
  sample
),
melted AS (
SELECT
  id,
  col_val_arr.col_name,
  col_val_arr.value
FROM
  melt_arr,melt_arr.col_val_arr AS col_val_arr
),
_corr AS (
#相関を計算する
SELECT
  t1.col_name AS col_name1,
  t2.col_name AS col_name2,
  CORR(t1.value,t2.value)
FROM
  melted AS t1
INNER JOIN
  melted AS t2
USING
  (id)
WHERE
  t1.col_name > t2.col_name
GROUP BY
  1,2
)
SELECT
  *
FROM
  _corr

解説

sampleデータの生成

spread sheetを使って連番で各行を生成しました。

melt_arr

col1 ~ col100を
ARRAY<STRUCT<value FLOAT64,col_name STRING>>
形式の1列に変換します。
なおこういった変換の際にはマルチカーソルが便利です。

BigQueryのエディタで、altを押しながら上下にドラッグをするとカーソルを複数出すことが出来ます。
BigQueryでマルチカーソルを使う - アクセス解析担当者のBigQuery日記

イメージ

f:id:sem-aa-bq:20190422234017g:plain
マルチカーソル

行っている処理は
1. まずcol1 ~ col100を貼り付けます。
2. altを押しながらドラッグでマルチカーソルを出します。
3. STRUCTまで書きます
4. colxxを各カーソルで選択し、コピーします
5. AS value," まで書きます
6. ペーストします。すると4でコピーした、行ごとのcolxxを貼り付けることが出来ます
7. " AS col_name), まで書きます。
8. エスケープキーを押して、マルチカーソルから抜け出し、col100の行だけカンマが不要なので削除します

BigQueryに限らず大体のエディタで使えます。

melted

melt_arrをUNNESTしています。
このサブクエリ、アウトプットのイメージで書いたようにデータが縦持ちになっています。

_corr

各列の相関を計算しています。

所感

いつか使う時が来る気がする。
以上です。