アクセスログをAWKで処理 - 堕落した人生を目指す日記

アクセスログの、特定のアクセスパターンを調べたいときにAWKで処理しているのですが、時々、携帯キャリア毎に調べたい時があります。User-Agentを見て振り分けてもいいのですが、どうも偽装されてるものとかがあります。そこで、携帯キャリアの公開しているIPレンジからマッチングをしたいと思うと、rubyとかperlでちょろっと書く方が早いはず。でも、さっと調べるならawkが楽。それに、困って調べるサーバは、大抵rubyが入ってない事もあるし、他人のサーバでむやみにrubyとかperlのモジュールを入れられない。でもGoogleさんに聞いてもawkでIPの処理を書いてるのはみかけない。

そこで、何処にでも入っている awk (gawk) のみで処理させたい。

で、書いてみた。条件は、

- awk (gawk) のみを使う
- マッチ処理は、ビット処理で比較のみに（でないとギガ単位のログは食わせられない）
- 携帯キャリアのIPレンジは別ファイルで管理
- キャリア毎に、メソッド毎に、5分単位で集計

#!/bin/awk -f
#

function ip2bin(address){
  split(address,ip,".")
  return lshift(ip[1],24) + lshift(ip[2],16) + lshift(ip[3],8) + ip[4]
}

function ip2mask(address,mask) {
  return sprintf("H%x",lshift(rshift(address,(32 - mask)),(32 - mask)))
}

function is_match(ipaddr, iparray) {
  result = 0
  for (var in iparray) {
    split(iparray[var],a,":")
    ip_s = ip2mask(ip2bin(ipaddr),a[2])
    if (ip_s == a[1] ) {result = 1; break}
  }
  return result
}

function is_mobile(ipaddr) {
  career = "PC"
  if (is_match(ipaddr, ip_docomo))   {career = "DoCoMo"}
  if (is_match(ipaddr, ip_kddiau))   {career = "KDDIAU"}
  if (is_match(ipaddr, ip_softbank)) {career = "SoftBank"}
  if (is_match(ipaddr, ip_emobile))  {career = "Emobile"}
  return career
}

function get_asso_time(time) {
  split(time, time_a, ":")
  min = time_a[3] - time_a[3] % 5
  return sprintf("T%s%02d",time_a[2],min)
}

function get_method(method) {
  split(method, m_a, " ")
  return m_a[1]
}

BEGIN {
  FS = "/"
  while (getline < "mobile_docomo.lst"   > 0) ip_docomo[++n]   = ip2mask(ip2bin($1),$2)":"$2
  while (getline < "mobile_kddiau.lst"   > 0) ip_kddiau[++n]   = ip2mask(ip2bin($1),$2)":"$2
  while (getline < "mobile_softbank.lst" > 0) ip_softbank[++n] = ip2mask(ip2bin($1),$2)":"$2
  while (getline < "mobile_emnet.lst"    > 0) ip_emobile[++n]  = ip2mask(ip2bin($1),$2)":"$2
  FS = " "
  car[1] = "PC"
  car[2] = "DoCoMo"
  car[3] = "KDDIAU"
  car[4] = "SoftBank"
  car[5] = "Emobile"
  met[1] = "GET"
  met[2] = "POST"
  met[3] = "HEAD"

  printf("T[Time]\tAll")
  for (c = 1; c <= 5; c++) {
    for (m =1; m <= 2; m++) {
      printf("\t%s %s",car[c],met[m])
    }
  }
  printf("\n")
}

{
  gsub(/"/,"",$6)

  time = get_asso_time($4)
  mobile = is_mobile($1)
  method = get_method($6)

  count[time] += 1

  if (count[time] == 1) {
    for (c in car){
      for (m in met) {
        carmet_count[time,car[c],met[m]] = 0
      }
      car_count[time,car[c]] = 0
    }
  }

  car_count[time,mobile] += 1
  carmet_count[time,mobile,method] += 1
}

END {
  n = asorti(count,dest)
  for (i = 1; i <= n; i++){
    printf("%s\t%d", dest[i],count[dest[i]])
    for (c = 1; c <= 5; c++) {
      for (m =1; m <= 2; m++) {
        #printf("\t%s %s : %d\n",car[c],met[m], carmet_count[dest[i],car[c],met[m]])
        printf("\t%d",carmet_count[dest[i],car[c],met[m]])
      }
    }
    printf("\n")
  }
}

携帯キャリアのIPレンジのファイルは、以下のような感じで用意。これは DoCoMoさん

210.153.84.0/24
210.136.161.0/24
210.153.86.0/24
124.146.174.0/24
124.146.175.0/24
202.229.176.0/24
202.229.177.0/24
202.229.178.0/24

メモ

- 変数の初期化とか、いまいち納得できてないが、とりあえず動くようにした
- 変数のスコープとかも良くわからん
- 数値は大きいと浮動小数点で持つことになり、配列につっこんで取り出すとおかしくなった
- 連想配列で数字の変数もおかしくなるようなので、先頭に H とか T とつけてごまかした
- 連想配列って、for で取り出すと順番保持されてないのね
- gawkの拡張を一つ使ってみた asorti
- やっぱrubyって便利だと実感