Pylogsparser : a use case, analysing ssh attacks

  • Sharebar

Wallix logo In this article we will see how easy it is to use the pylogsparser library through a simple use case. It should help you start working on your own project involving log analysis.

The problem

Here at Wallix we mostly switched to using VPN links when we need to access local resources from the outside world, and all inbound SSH traffic is redirected to a small unused server, completely isolated in our firewall’s DMZ. We could have simply shut down the service, but it is more interesting to keep it up as a kind of “honeypot”, giving us insight on what happens to any machine exposed to the Internet. We will therefore need to extract some interesting data from our connection logs, such as where do attacks come from and what are the most attacked accounts; the pylogsparser library will help us do just that in a few lines of code.

The solution

We will use a few python libraries to tackle our problem :

  • the pylogsparser library, obviously, will be used to parse the SSH logs;
  • the matplotlib library will be used to plot pie charts related to our findings;
  • the GeoIP library will be used to translate incoming IPs into countries of origin;
  • the numpy library will be used once to define a pretty color map for our pie charts.

All these libraries can be installed through PyPI with easy_install, and most of them are packaged for your favorite OS.

Since pylogsparser includes a SSH log normalizer, using it in our case is pretty straightforward: all we have to do is instantiate a LogNormalizer object by giving it the default normalizers path.

from logsparser.lognormalizer import LogNormalizer as LN
 
normalizer = LN('/usr/share/normalizers')
auth_logs = open('/var/log/auth.log', 'r')
 
l = auth_logs.next()[:-1] # grab the first log line, remove the trailing \n
log = {'raw' : l } # a LogNormalizer expects input as a dictionary
normalizer.normalize(log)

And that’s it ! Now our “log” dictionary contains extra metadata that we can exploit.

If you look at the SSH normalizer’s documentation, you can see that SSH logs notifying of a connection failure are tagged with these fields:

  • “action” will be set to “fail”
  • “user” will be set to the account for which a connection was failed
  • “source_ip” will be set to the incoming IP of the connection attempt

So now, let’s find out more about the connection attempts : let’s classify the failures by country of origin.

from logsparser.lognormalizer import LogNormalizer as LN
import GeoIP
 
normalizer = LN('/usr/share/normalizers')
locator = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
 
LOGFILE = '/var/log/auth.log'
 
auth_logs = open(LOGFILE, 'r')
 
attacks = {}
origin_unknown = {}
 
print "Parsing the log file ... "
 
for log in auth_logs:
    l = {"raw" : log }
    normalizer.normalize(l)
    if l.get('action') == 'fail':
        country = locator.country_name_by_addr(l['source_ip'])
        if country:
            attacks[country] = attacks.get(country, 0) + 1
        else:
            origin_unknown[l['source_ip']] =  origin_unknown.get(l['source_ip'], 0) + 1
 
print "Done, %i attacks with a known origin found:" % sum(attacks.values())
for i,j in sorted(attacks.items(), cmp = lambda a,b: cmp(a[1], b[1]) ): #This will sort the list by ascending attempts
    print "\t%s (%i attempts)" % (i,j)
if origin_unknown:
    print "The following IP could not be found in the GeoIP database:"
    for i,j in sorted(origin_unknown.items(), cmp = lambda a,b: cmp(a[1], b[1]) ):
        print "\t%s (%i attempts)" % (i,j)

This is nice, but images have more impact than words, so let’s plot a pie chart showing the results we just found out.
Additionnally, I also want to know what accounts are used for attacks. Here it is:

from logsparser.lognormalizer import LogNormalizer as LN
import matplotlib.pyplot as plt
import numpy as np
import GeoIP
 
normalizer = LN('/usr/share/normalizers')
locator = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
color_map = plt.cm.rainbow
 
LOGFILE = '/var/log/auth.log'
 
auth_logs = open(LOGFILE, 'r')
 
attacks = {}
users = {}
origin_unknown = {}
 
print "Parsing the log file ... "
 
for log in auth_logs:
    l = {"raw" : log }
    normalizer.normalize(l)
    if l.get('action') == 'fail' and l.get('program') == 'sshd':
        if l.get('user') and l.get('user') not in ['root']: # root is used in more than 80% of the attacks, let's see the rest instead
            users[l['user']] = users.get(l['user'], 0) + 1
        country = locator.country_name_by_addr(l['source_ip'])
        if country:
            attacks[country] = attacks.get(country, 0) + 1
        else:
            origin_unknown[l['source_ip']] =  origin_unknown.get(l['source_ip'], 0) + 1
 
print "Done, %i attacks with a known origin found." % sum(attacks.values())
if origin_unknown:
    print "The following IP could not be found in the GeoIP database:"
    for i,j in sorted(origin_unknown.items(), cmp = lambda a,b: cmp(a[1], b[1]) ):
        print "\t%s (%i attempts)" % (i,j)
 
fig = plt.figure(figsize = (20,10), dpi = 80)
ax1 = fig.add_subplot(121)
to_draw = sorted(attacks.items(),cmp = lambda a,b: cmp(a[0], b[0]) )
labels, values = zip(*to_draw)
ax1.pie(values,
       labels = labels,
       colors = color_map(np.linspace(0,1,len(values))),
       shadow = True,
       explode = [ 0.01 + 0.02 * (i % 3) for i in range(len(values)) ]
       )
ax1.set_title('Origins of SSH break-in attempts')
 
ax2 = fig.add_subplot(122)
top_ten = sorted(users.items(), cmp = lambda a,b: cmp(a[1], b[1]), reverse = True)[:15]
labels, values = zip(*top_ten)
ax2.pie(values,
       labels = labels,
       colors = color_map(np.linspace(0,1,len(values))),
       shadow = True,
       explode = [ 0.01 + 0.02 * (i % 3) for i in range(len(values)) ]
       )
ax2.set_title('Top 15 Attacked Accounts (not including root)')
fig.savefig("attacks.png", dpi = fig.dpi)

Run the script and you should have a picture called “attacks.png” in your current folder if all goes well. Here is what we got from our “honeypot”, using last month’s logs.

Analysis

Those graphs are interesting in many ways, even surprising on some aspects:

  • The “usual suspects” are present : China, Russia, USA, Korea … But also Netherlands and Spain ! Could there be a rise of hackers in these countries ? Or maybe botnets, or compromised hosts ? The absence of Germany is also quite a surprise.
  • The root account was targeted in more than 80% of the logged attacks. It is vital (and very easy, see a previous article on this blog) not to allow root access through SSH !
  • Some targeted accounts are not really surprising : admin, test, guest, git … even “a”, for the lazy admin …
  • … but some others are intriguing : marine ? 25 ? PlcmSplp looks like a random account at first, but since it is the 3rd most targeted account, there must be a logical explanation for it. And indeed there is, a quick google search will show that PlcmSplp is a user created by a software called SIPX, whose default password seems well known. This is therefore a possible vulnerability on a system. You should note that by observing these trends, you could become aware of unknown exploited vulnerabilities (such a vulnerability is called a “0-day”); this is why honeypots are crucial for security researchers.

Concerning the use of pylogsparser, we saw in this example how easy it is to extract relevant data from logs with it. One might object that we could have done without it by using a regular expression to parse the logs, but this approach has a few disadvantages :

  • why lose time for something that is already done in a library ?
  • writing the regular expression is not always easy, takes time and requires knowledge of the log format
  • using the regular expression approach makes your script less evolutive : what if you want to add apache logs, for example ? With pylogsparser, it is a simple matter of writing a new condition on metadata.

We hope this article increased your interest towards pylogsparser and that it will help you getting started with your own log analysis project. There are many things that could be done : turn your logs into a CSV file that can be loaded into an ETL or a database, send a mail notification when a certain condition on metadata is met, make a timelapse animation of the origin of SSH attacks on a world map … So start coding and happy log digging ! Please tell us what you do with pylogsparser !

Article contributed by Matthieu Huin, R&D engineer in Wallix LogBox development team.

Incoming search terms:

  • pylogsparser example
  • pylogsparser examples
  • wallix pylogsparser examples
  • pylogsparser tutorial
  • matplotlib pie
  • ssh geoip
  • matplotlib pie chart
  • python auth log
  • python script parse auth log
  • honeypot ssh linux
This entry was posted in development, log, ssh and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*


+ 8 = fifteen

* Copy This Password *

* Type Or Paste Password Here *

48,284 Spam Comments Blocked so far by Spam Free Wordpress

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>