April 10, 2018

Cleaning up messy distributed logs, part 2

This is a story on sorting logs. First part can be found here, it's where you can get all the nitty gritty. If you are not feeling like it here is a short elevator pitch. Our kibana logs are a mess, each of the services seems to have different log format and trace backs are split across many entries. My goal is to fix it.

Second part is much much shorter as it's one of the steps I took on path to sorting this pet peeve of mine. Unfortunately, I can not spend my whole time at work to fix it as it is not a crucial part of our application. At least it is not at this moment.

My first idea was to tackle "easy" issue of grok parsing failure. Our configuration is rather bloated IMO so I trimmed it (I had mentioned that it is not my concern but it easier to reason about simple code). From this:

      input {
        gelf { }
        heartbeat { }
        syslog {
          port => 5145
        }
      }
      filter {
        ruby {
          code => "
            event.to_hash.keys.each { |k| event[ k.gsub('"'.'"','"'_'"') ] = 
            event.remove(k) if k.include?'"'.'"' }
          "
        }
        grok {
          match => {
            "message" => [
              # Logspout
              "%{SYSLOG5424PRI}%{NONNEGINT:ver} +(?:%{TIMESTAMP_ISO8601:ts}|-) 
              +(?:%{HOSTNAME:containerid}|-) 
              +(?:%{WORD:process}\.%{INT:processnumber}\.%{WORD:processsha}|%{NOTSPACE:process}) 
              +(?:%{NOTSPACE:proc}|-) +(?:%{WORD:msgid}|-) 
              +(?:%{SYSLOG5424SD:sd}|-|) +%{GREEDYDATA:msg}"
            ]
          }
        }
      }
      output {
        elasticsearch {
          hosts => ["elasticsearch:9200"]
        }
        stdout {
          codec => rubydebug
        }
      }

I went to this:

     input {
      syslog { }
      }
      filter {
        grok {
          match => {
            "message" => [
              # Logspout
              "%{SYSLOG5424PRI}%{NONNEGINT:ver} +(?:%{TIMESTAMP_ISO8601:ts}|-) 
              +(?:%{HOSTNAME:containerid}|-) 
              +(?:%{WORD:process}\.%{INT:processnumber}\.%{WORD:processsha}|%{NOTSPACE:process}) 
              +(?:%{NOTSPACE:proc}|-) +(?:%{WORD:msgid}|-) 
              +(?:%{SYSLOG5424SD:sd}|-|) +%{GREEDYDATA:msg}"
            ]
          }
        }
      }
      output {
        elasticsearch {
          hosts => ["elasticsearch:9200"]
        }
      }

We had few plugins that were doing nothing so went under the cleaver. Since I can run the swarm locally I'd verify that all is logged and displayed correctly in kibana.

So then I did what everyone does, google it hard. Actually duckduckgo it but it just does not sound right. After lot of reading finally found something relevant to my case. So the perpetrator is syslog plugin, I mentioned before that logs format might not be consistent and possibly (for sure TBH) not up to syslog's spec. As author recommended I had replaced syslog { } with tcp and udp plugins.

     input {
          tcp {
              port => 5145
              type => syslog
          }
          udp {
              port => 5145
              type => syslog
          }
      }

Logs are now without _grokparsefailure tags. Still bit messy but at least correctly and efficiently parsed. See you in the next part.

Tags: kibana elk logstash microservices