In this article we will see how easy it is to use the pylogsparser library through a simple use case. It should help you start working on your own project involving log analysis.
The problem
Here at Wallix we mostly switched to using VPN links when we need to access local resources from the outside world, and all inbound SSH traffic is redirected to a small unused server, completely isolated in our firewall’s DMZ. We could have simply shut down the service, but it is more interesting to keep it up as a kind of “honeypot”, giving us insight on what happens to any machine exposed to the Internet. We will therefore need to extract some interesting data from our connection logs, such as where do attacks come from and what are the most attacked accounts; the pylogsparser library will help us do just that in a few lines of code.
The solution
We will use a few python libraries to tackle our problem :
- the pylogsparser library, obviously, will be used to parse the SSH logs;
- the matplotlib library will be used to plot pie charts related to our findings;
- the GeoIP library will be used to translate incoming IPs into countries of origin;
- the numpy library will be used once to define a pretty color map for our pie charts.
All these libraries can be installed through PyPI with easy_install, and most of them are packaged for your favorite OS.
Since pylogsparser includes a SSH log normalizer, using it in our case is pretty straightforward: all we have to do is instantiate a LogNormalizer object by giving it the default normalizers path.
from logsparser.lognormalizer import LogNormalizer as LN normalizer = LN('/usr/share/normalizers') auth_logs = open('/var/log/auth.log', 'r') l = auth_logs.next()[:-1] # grab the first log line, remove the trailing \n log = {'raw' : l } # a LogNormalizer expects input as a dictionary normalizer.normalize(log) |
And that’s it ! Now our “log” dictionary contains extra metadata that we can exploit.
If you look at the SSH normalizer’s documentation, you can see that SSH logs notifying of a connection failure are tagged with these fields:
- “action” will be set to “fail”
- “user” will be set to the account for which a connection was failed
- “source_ip” will be set to the incoming IP of the connection attempt
So now, let’s find out more about the connection attempts : let’s classify the failures by country of origin.
from logsparser.lognormalizer import LogNormalizer as LN import GeoIP normalizer = LN('/usr/share/normalizers') locator = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE) LOGFILE = '/var/log/auth.log' auth_logs = open(LOGFILE, 'r') attacks = {} origin_unknown = {} print "Parsing the log file ... " for log in auth_logs: l = {"raw" : log } normalizer.normalize(l) if l.get('action') == 'fail': country = locator.country_name_by_addr(l['source_ip']) if country: attacks[country] = attacks.get(country, 0) + 1 else: origin_unknown[l['source_ip']] = origin_unknown.get(l['source_ip'], 0) + 1 print "Done, %i attacks with a known origin found:" % sum(attacks.values()) for i,j in sorted(attacks.items(), cmp = lambda a,b: cmp(a[1], b[1]) ): #This will sort the list by ascending attempts print "\t%s (%i attempts)" % (i,j) if origin_unknown: print "The following IP could not be found in the GeoIP database:" for i,j in sorted(origin_unknown.items(), cmp = lambda a,b: cmp(a[1], b[1]) ): print "\t%s (%i attempts)" % (i,j) |
This is nice, but images have more impact than words, so let’s plot a pie chart showing the results we just found out.
Additionnally, I also want to know what accounts are used for attacks. Here it is:
from logsparser.lognormalizer import LogNormalizer as LN import matplotlib.pyplot as plt import numpy as np import GeoIP normalizer = LN('/usr/share/normalizers') locator = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE) color_map = plt.cm.rainbow LOGFILE = '/var/log/auth.log' auth_logs = open(LOGFILE, 'r') attacks = {} users = {} origin_unknown = {} print "Parsing the log file ... " for log in auth_logs: l = {"raw" : log } normalizer.normalize(l) if l.get('action') == 'fail' and l.get('program') == 'sshd': if l.get('user') and l.get('user') not in ['root']: # root is used in more than 80% of the attacks, let's see the rest instead users[l['user']] = users.get(l['user'], 0) + 1 country = locator.country_name_by_addr(l['source_ip']) if country: attacks[country] = attacks.get(country, 0) + 1 else: origin_unknown[l['source_ip']] = origin_unknown.get(l['source_ip'], 0) + 1 print "Done, %i attacks with a known origin found." % sum(attacks.values()) if origin_unknown: print "The following IP could not be found in the GeoIP database:" for i,j in sorted(origin_unknown.items(), cmp = lambda a,b: cmp(a[1], b[1]) ): print "\t%s (%i attempts)" % (i,j) fig = plt.figure(figsize = (20,10), dpi = 80) ax1 = fig.add_subplot(121) to_draw = sorted(attacks.items(),cmp = lambda a,b: cmp(a[0], b[0]) ) labels, values = zip(*to_draw) ax1.pie(values, labels = labels, colors = color_map(np.linspace(0,1,len(values))), shadow = True, explode = [ 0.01 + 0.02 * (i % 3) for i in range(len(values)) ] ) ax1.set_title('Origins of SSH break-in attempts') ax2 = fig.add_subplot(122) top_ten = sorted(users.items(), cmp = lambda a,b: cmp(a[1], b[1]), reverse = True)[:15] labels, values = zip(*top_ten) ax2.pie(values, labels = labels, colors = color_map(np.linspace(0,1,len(values))), shadow = True, explode = [ 0.01 + 0.02 * (i % 3) for i in range(len(values)) ] ) ax2.set_title('Top 15 Attacked Accounts (not including root)') fig.savefig("attacks.png", dpi = fig.dpi) |
Run the script and you should have a picture called “attacks.png” in your current folder if all goes well. Here is what we got from our “honeypot”, using last month’s logs.
Analysis
Those graphs are interesting in many ways, even surprising on some aspects:
- The “usual suspects” are present : China, Russia, USA, Korea … But also Netherlands and Spain ! Could there be a rise of hackers in these countries ? Or maybe botnets, or compromised hosts ? The absence of Germany is also quite a surprise.
- The root account was targeted in more than 80% of the logged attacks. It is vital (and very easy, see a previous article on this blog) not to allow root access through SSH !
- Some targeted accounts are not really surprising : admin, test, guest, git … even “a”, for the lazy admin …
- … but some others are intriguing : marine ? 25 ? PlcmSplp looks like a random account at first, but since it is the 3rd most targeted account, there must be a logical explanation for it. And indeed there is, a quick google search will show that PlcmSplp is a user created by a software called SIPX, whose default password seems well known. This is therefore a possible vulnerability on a system. You should note that by observing these trends, you could become aware of unknown exploited vulnerabilities (such a vulnerability is called a “0-day”); this is why honeypots are crucial for security researchers.
Concerning the use of pylogsparser, we saw in this example how easy it is to extract relevant data from logs with it. One might object that we could have done without it by using a regular expression to parse the logs, but this approach has a few disadvantages :
- why lose time for something that is already done in a library ?
- writing the regular expression is not always easy, takes time and requires knowledge of the log format
- using the regular expression approach makes your script less evolutive : what if you want to add apache logs, for example ? With pylogsparser, it is a simple matter of writing a new condition on metadata.
We hope this article increased your interest towards pylogsparser and that it will help you getting started with your own log analysis project. There are many things that could be done : turn your logs into a CSV file that can be loaded into an ETL or a database, send a mail notification when a certain condition on metadata is met, make a timelapse animation of the origin of SSH attacks on a world map … So start coding and happy log digging ! Please tell us what you do with pylogsparser !
Article contributed by Matthieu Huin, R&D engineer in Wallix LogBox development team.
Incoming search terms:
- pylogsparser example
- pylogsparser examples
- wallix pylogsparser examples
- pylogsparser tutorial
- matplotlib pie
- ssh geoip
- python auth log
- matplotlib pie chart
- python auth log parser
- D M Z KOREA MAP

Pingback: Wallix: Pylogsparser : a use case, analysing... | Python | Syngu
Pingback: Pylogsparser : a use case, analysing ssh attacks | Linux | Syngu
Pingback: Links 28/9/2011: Linux 3.1 RC8, Gains for Android Tablets | Techrights
Looks really slick!
Could you do a small tutorial on the creation of a normaliser?
Thanks for such a great tool!
Hi David,
You read in our mind ! A tutorial is coming next week…
For the second script, should not the loop be on attacks.items() instead of origin_unknown.items() ?
You are absolutely right, you spotted a bad case of unchecked copy-pasting
The article will be corrected soon. Thanks for the tip !
Hi,
I am having trouble getting Pylogsparser to work. After installing Pylogsparser and trying this:
>>> from logsparser.lognormalizer import LogNormalizer as LN
I get the following errors in Python 2.7.2:
Traceback (most recent call last):
File “”, line 1, in
File “/usr/lib/python2.7/site-packages/logsparser/lognormalizer.py”, line 35, in
from normalizer import Normalizer
File “/usr/lib/python2.7/site-packages/logsparser/normalizer.py”, line 35, in
from lxml.etree import parse, tostring
ImportError: No module named lxml.etree
>>>
Hi George,
It seems lxml python package in missing on your system.
You can install it by using easy_install :
# easy_install lxml
You can also install lxml package with the package management tool of your
linux distribution. For example on Debian :
# aptitude install python-lxml
Thanks Fbo. I installed lxml, but am now getting a different error:
———-
Traceback (most recent call last):
File “test.py”, line 6, in
normalizer = LN(‘/usr/share/normalizers’)
File “/usr/lib/python2.7/site-packages/logsparser/lognormalizer.py”, line 78, in __init__
raise ValueError, “Invalid normalizer directory : %s” % norm_path
ValueError: Invalid normalizer directory : /usr/share/normalizers
———-
There is no /usr/share/normalizers directory on my system.
Hi George,
The installation path for normalizers has changed recently and you should
find normalizers in $prefix/share/logsparser/normalizers/ where $prefix can be
either /usr/ or /usr/local/. So try to instanciate LogNormalizer as follow :
normalizer = LN(‘/usr/share/logsparser/normalizers’)
or
normalizer = LN(‘/usr/local/share/logsparser/normalizers’)
normalizer = LN(‘/usr/share/logsparser/normalizers’)
worked! thx