web analytics

Posts Tagged ‘system security’

Getting the Most Out of Your PHP Log Files: A Practical Guide

It could rightfully be said that logs are one of the most underestimated and underutilized tools at a freelance php developer’s disposal. Despite the wealth of information they can offer, it is not uncommon for logs to be the last place a developer looks when trying to resolve a problem.

In truth, PHP log files should in many cases be the first place to look for clues when problems occur. Often, the information they contain could significantly reduce the amount of time spent pulling out your hair trying to track down a gnarly bug.

But perhaps even more importantly, with a bit of creativity and forethought, your logs files can be leveraged to serve as a valuable source of usage information and analytics. Creative use of log files can help answer questions such as: What browsers are most commonly being used to visit my site? What’s the average response time from my server? What was the percentage of requests to the site root? How has usage changed since we deployed the latest updates? And much, much more.

PHP log files

This article provides a number of tips on how to configure your log files, as well as how to process the information that they contain, in order to maximize the benefit that they provide.

Although this article focuses technically on logging for PHP developers, much of the information presented herein is fairly “technology agnostic” and is relevant to other languages and technology stacks as well.

Note: This article presumes basic familiarity with the Unix shell. For those lacking this knowledge, an Appendix is provided that introduces some of the commands needed for accessing and reading log files on a Unix system.

Our PHP Log File Example Project

As an example project for discussion purposes in this article, we will take Symfony Standard as a working project and we’ll set it up on Debian 7 Wheezy with rsyslogd, nginx, and PHP-FPM.

composer create-project symfony/framework-standard-edition my "2.6.*"

This quickly gives us a working test project with a nice UI.

Tips for Configuring Your Log Files

Here are some pointers on how to configure your log files to help maximize their value.

Error Log Confguration

Error logs represent the most basic form of logging; i.e., capturing additional information and detail when problems occur. So, in an ideal world, you would want there to be no errors and for your error logs to be empty. But when problems do occur (as they invariably do), your error logs should be one of the first stops you make on your debugging trail.

Error logs are typically quite easy to configure.

For one thing, all error and crash messages can be logged in the error log in exactly the same format in which they would otherwise be presented to a user. With some simple configuration, the end user will never need to see those ugly error traces on your site, while devops will be still able to monitor the system and review these error messages in all their gory detail. Here’s how to setup this kind of logging in PHP:

log_errors = On
error_reporting = E_ALL
error_log = /path/to/my/error/log

Another two lines that are important to include in a log file for a live site, to preclude gory levels of error detail from being to presented to users, are:

display_errors = Off
display_startup_errors = Off

System Log (syslog) Confguration

There are many generally compatible implementations of the syslog daemon in the open source world including:

  • syslogd and sysklogd – most often seen on BSD family systems, CentOS, Mac OS X, and others
  • syslog-ng – default for modern Gentoo and SuSE builds
  • rsyslogd – widely used on the Debian and Fedora families of operating systems

(Note: In this article, we’ll be using rsyslogd for our examples.)

The basic syslog configuration is generally adequate for capturing your log messages in a system-wide log file (normally /var/log/syslog; might also be /var/log/messages or /var/log/system.log depending on the distro you’re using).

The system log provides several log facilities, eight of which (LOG_LOCAL0 through LOG_LOCAL7) are reserved for user-deployed projects. Here, for example, is how you might setup LOG_LOCAL0 to write to 4 separate log files, based on logging level (i.e., error, warning, info, debug):

# /etc/rsyslog.d/my.conf

local0.err      /var/log/my/err.log
local0.warning  /var/log/my/warning.log
local0.info     -/var/log/my/info.log
local0.debug    -/var/log/my/debug.log

Now, whenever you write a log message to LOG_LOCAL0 facility, the error messages will go to /var/log/my/err.log, warning messages will go to /var/log/my/warning.log, and so on. Note, though, that the syslog daemon filters messages for each file based on the rule of “this level and higher”. So, in the example above, all error messages will appear in all four configured files, warning messages will appear in all but the error log, info messages will appear in the info and debug logs, and debug messages will only go to debug.log.

One additional important note; The - signs before the info and debug level files in the above configuration file example indicate that writes to those files should be perfomed asynchronously (since these operations are non-blocking). This is typically fine (and even recommended in most situations) for info and debug logs, but it’s best to have writes to the error log (and most prpobably the warning log as well) be synchronous.

In order to shut down a less important level of logging (e.g., on a production server), you may simply redirect related messages to /dev/null (i.e., to nowhere):

local0.debug    /dev/null # -/var/log/my/debug.log

One specific customization that is useful, especially to support some of the PHP log file parsing we’ll be discussing later in this article, is to use tab as the delimiter character in log messages. This can easily be done by adding the following file in /etc/rsyslog.d:

# /etc/rsyslog.d/fixtab.conf

$EscapeControlCharactersOnReceive off

And finally, don’t forget to restart the syslog daemon after you make any configuration changes in order for them to take effect:

service rsyslog restart

Server Log Confguration

Unlike application logs and error logs that you can write to, server logs are exclusively written to by the corresponding server daemons (e.g., web server, database server, etc.) on each request. The only “control” you have over these logs is to the extent that the server allows you to configure its logging functionality. Though there can be a lot to sift through in these files, they are often the only way to get a clear sense of what’s going on “under the hood” with your server.

Let’s deploy our Symfony Standard example application on nginx environment with MySQL storage backend. Here’s the nginx host config we will be using:

server {
    server_name my.log-sandbox;
    root /var/www/my/web;

    location / {
        # try to serve file directly, fallback to app.php
        try_files $uri /app.php$is_args$args;
    }
    # DEV
    # This rule should only be placed on your development environment
    # In production, don't include this and don't deploy app_dev.php or config.php
    location ~ ^/(app_dev|config)\.php(/|$) {
        fastcgi_pass unix:/var/run/php5-fpm.sock;
        fastcgi_split_path_info ^(.+\.php)(/.*)$;
        include fastcgi_params;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_param HTTPS off;
    }
    # PROD
    location ~ ^/app\.php(/|$) {
        fastcgi_pass unix:/var/run/php5-fpm.sock;
        fastcgi_split_path_info ^(.+\.php)(/.*)$;
        include fastcgi_params;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_param HTTPS off;
        # Prevents URIs that include the front controller. This will 404:
        # http://domain.tld/app.php/some-path
        # Remove the internal directive to allow URIs like this
        internal;
    }

    error_log /var/log/nginx/my_error.log;
    access_log /var/log/nginx/my_access.log;
}

With regard to the last two directives above: access_log represents the general requests log, while error_log is for errors, and, as with application error logs, it’s worth setting up extra monitoring to be alerted to problems so you can react quickly.

Note: This is an intentionally oversimplified nginx config file that is provided for example purposes only. It pays almost no attention to security and performance and shouldn’t be used as-is in any “real” environment.

This is what we get in /var/log/nginx/my_access.log after typing http://my.log-sandbox/app_dev.php/ in browser and hitting Enter.

192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /app_dev.php/ HTTP/1.1" 200 6715 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/framework/css/body.css HTTP/1.1" 200 6657 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/framework/css/structure.css HTTP/1.1" 200 1191 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/acmedemo/css/demo.css HTTP/1.1" 200 2204 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/acmedemo/images/welcome-quick-tour.gif HTTP/1.1" 200 4770 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/acmedemo/images/welcome-demo.gif HTTP/1.1" 200 4053 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/acmedemo/images/welcome-configure.gif HTTP/1.1" 200 3530 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /favicon.ico HTTP/1.1" 200 6518 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:30 +0300] "GET /app_dev.php/_wdt/e50d73 HTTP/1.1" 200 13265 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"

This shows that, for serving one page, the browser actually performs 9 HTTP calls. 7 of those, however, are requests to static content, which are plain and lightweight. However, they still take network resources and this is what can be optimized by using various sprites and minification techniques.

While those optimisations are to be discussed in another article, what’s relavant here is that we can log requests to static contents separately by using another location directive for them:

location ~ \.(jpg|jpeg|gif|png|ico|css|zip|tgz|gz|rar|bz2|pdf|txt|tar|wav|bmp|rtf|js)$ {
    access_log /var/log/nginx/my_access-static.log;
}

Remember that nginx location performs simple regular expression matching, so you can include as many static contents extensions as you expect to dispatch on your site.

Parsing such logs is no different than parsing application logs.

Other Logs Worth Mentioning

Two other PHP logs worth mentioning are the debug log and data storage log.

The Debug Log

Another convenient thing about nginx logs is the debug log. We can turn it on by replacing the error_log line of the config with the following (requires that the nginx debug module be installed):

error_log /var/log/nginx/my_error.log debug;

The same setting applies for Apache or whatever other webserver you use.

And incidentally, debug logs are not related to error logs, even though they are configured in the error_logdirective.

Although the debug log can indeed be verbose (a single nginx request, for example, generated 127KB of log data!), it can still be very useful. Wading through a log file may be cumbersome and tedious, but it can often quickly provide clues and information that greatly help accelerate the debugging process.

In particular, the debug log can really help with debugging nginx configurations, especially the most complicated parts, like location matching and rewrite chains.

Of course, debug logs should never be enabled in a production environment. The amount of space they use also and the amount of information that they store means a lot of I/O load on your server, which can degrade the whole system’s performance significantly.

Data Storage Logs

Another type of server log (useful for debugging) is data storage logs. In MySQL, you can turn them on by adding these lines:

[mysqld]
general_log = 1
general_log_file = /var/log/mysql/query.log

These logs simply contain a list of queries run by the system while serving database requests in chronological order, which can be helpful for various debugging and tracing needs. However, they should not stay enabled on production systems, since they will generate extra unnecessary I/O load, which affects performance.

Writing to Your Log Files

PHP itself provides functions for opening, writing to, and closing log files (openlog(), syslog(), and closelog(), respectively).

There are also numerous logging libraries for the PHP developer, such as Monolog (popular among Symfonyand Laravel users), as well as various framework-specific implementations, such as the logging capabilities incorporated into CakePHP. Generally, libraries like Monolog not only wrap syslog() calls, but also allow using other backend functionality and tools.

Here’s a simple example of how to write to the log:

<?php

openlog(uniqid(), LOG_ODELAY, LOG_LOCAL0);
syslog(LOG_INFO, 'It works!');

Our call here to openlog:

  • configures PHP to prepend a unique identifier to each system log message within the script’s lifetime
  • sets it to delay opening the syslog connection until the first syslog() call has occurred
  • sets LOG_LOCAL0 as the default logging facility

Here’s what the contents of the log file would look like after running the above code:

# cat /var/log/my/info.log
Mar  2 00:23:29 log-sandbox 54f39161a2e55: It works!

Maximizing the Value of Your PHP Log Files

Now that we’re all good with theory and basics, let’s see how much we can get from logs making as few changes as possible to our sample Symfony Standard project.

First, let’s create the scripts src/log-begin.php (to properly open and configure our logs) and src/log-end.php(to log information about successful completion). Note that, for simplicity, we’ll just write all messages to the info log.

# src/log-begin.php

<?php

define('START_TIME', microtime(true));
openlog(uniqid(), LOG_ODELAY, LOG_LOCAL0);
syslog(LOG_INFO, 'BEGIN');
syslog(LOG_INFO, "URI\t{$_SERVER['REQUEST_URI']}");
$browserHash = substr(md5($_SERVER['HTTP_USER_AGENT']), 0, 7);
syslog(LOG_INFO, "CLIENT\t{$_SERVER['REMOTE_ADDR']}\t{$browserHash}"); <br />

# src/log-end.php

<?php

syslog(LOG_INFO, "DISPATCH TIME\t" . round(microtime(true) - START_TIME, 2));
syslog(LOG_INFO, 'END');

And let’s require these scripts in app.php:

<?php

require_once(dirname(__DIR__) . '/src/log-begin.php');
syslog(LOG_INFO, "MODE\tPROD");

# original app.php contents

require_once(dirname(__DIR__) . '/src/log-end.php');

For the development environment, we want to require these scripts in app_dev.php as well. The code to do so would be the same as above, except we would set the MODE to DEV rather than PROD.

We also want to track what controllers are being invoked, so let’s add one more line in Acme\DemoBundle\EventListener\ControllerListener, right at the beginning of the ControllerListener::onKernelController() method:

syslog(LOG_INFO, "CONTROLLER\t" . get_class($event->getController()[0]));

Note that these changes total a mere 15 extra lines of code, but can collectively yield a wealth of information.

Analyzing the Data in Your Log Files

For starters, let’s see how many HTTP requests are required to serve each page load.

Here’s the info in the logs for one request, based on the way we’ve configured our logging:

Mar  3 12:04:20 log-sandbox 54f58724b1ccc: BEGIN
Mar  3 12:04:20 log-sandbox 54f58724b1ccc: URI    /app_dev.php/
Mar  3 12:04:20 log-sandbox 54f58724b1ccc: CLIENT 192.168.56.1    1b101cd
Mar  3 12:04:20 log-sandbox 54f58724b1ccc: MODE   DEV
Mar  3 12:04:23 log-sandbox 54f58724b1ccc: CONTROLLER Acme\DemoBundle\Controller\WelcomeController
Mar  3 12:04:25 log-sandbox 54f58724b1ccc: DISPATCH TIME  4.51
Mar  3 12:04:25 log-sandbox 54f58724b1ccc: END
Mar  3 12:04:25 log-sandbox 54f5872967dea: BEGIN
Mar  3 12:04:25 log-sandbox 54f5872967dea: URI    /app_dev.php/_wdt/59b8b6
Mar  3 12:04:25 log-sandbox 54f5872967dea: CLIENT 192.168.56.1    1b101cd
Mar  3 12:04:25 log-sandbox 54f5872967dea: MODE   DEV
Mar  3 12:04:28 log-sandbox 54f5872967dea: CONTROLLER Symfony\Bundle\WebProfilerBundle\Controller\ProfilerController
Mar  3 12:04:29 log-sandbox 54f5872967dea: DISPATCH TIME  4.17
Mar  3 12:04:29 log-sandbox 54f5872967dea: END

So now we know that each page load is actually served with two HTTP requests.

Actually there are two points worth mentioning here. First, the two requests per page load is for using Symfony in dev mode (which I have done throughout this article). You can identify dev mode calls by searching for /app-dev.php/ URL chunks. Second, let’s say each page load is served with two subsequent requests to the Symfony app. As we saw earlier in the nginx access logs, there are actually more HTTP calls, some of which are for static content.

OK, now let’s surf a bit on the demo site (to build up the data in the log files) and let’s see what else we can learn from these logs.

How many requests were served in total since the beginning of the logfile?

# grep -c BEGIN info.log
10

Did any of them fail (did the script shut down without reaching the end)?

# grep -c END info.log
10

We see that the number of BEGIN and END records match, so this tells us that all of the calls were successful. (If the PHP script had not completed successfully, it would not have reached execution of the src/log-end.phpscript.)

What was the percentage of requests to the site root?

# `grep -cE "\s/app_dev.php/$" info.log`
2

This tells us that there were 2 page loads of the site root. Since we previously learned that (a) there are 2 requests to the app per page load and (b) there were a total of 10 HTTP requests, the percentage of requests to the site root was 40% (i.e., 2×2/10).

Which controller class is responsible for serving requests to site root?

# grep -E "\s/$|\s/app_dev.php/$" info.log | head -n1
Mar  3 12:04:20 log-sandbox 54f58724b1ccc: URI  /app_dev.php/

# grep 54f58724b1ccc info.log | grep CONTROLLER
Mar  3 12:04:23 log-sandbox 54f58724b1ccc: CONTROLLER   Acme\DemoBundle\Controller\WelcomeController

Here we used the unique ID of a request to check all log messages related to that single request. We thereby were able to determine that the controller class responsible for serving requests to site root is Acme\DemoBundle\Controller\WelcomeController.

Which clients with IPs of subnet 192.168.0.0/16 have accessed the site?

# grep CLIENT info.log | cut -d":" -f4 | cut -f2 | sort | uniq
192.168.56.1

As expected in this simple test case, only my host computer has accessed the site. This is of course a very simplistic example, but the capability that it demonstrates (of being able to analyse the sources of the traffic to your site) is obviously quite powerful and important.

How much of the traffic to my site has been from FireFox?

Having 1b101cd as the hash of my Firefox User-Agent, I can answer this question as follows:

# grep -c 1b101cd info.log
8
# grep -c CLIENT info.log
10

Answer: 80% (i.e., 8/10)

What is the percentage of requests that yielded a “slow” response?

For purposes of this example, we’ll define “slow” as taking more than 5 seconds to provide a response. Accordingly:

# grep "DISPATCH TIME" info.log | grep -cE "\s[0-9]{2,}\.|\s[5-9]\."
2

Answer: 20% (i.e., 2/10)

Did anyone ever supply GET parameters?

# grep URI info.log | grep \?

No, Symfony standard uses only URL slugs, so this also tells us here that no one has attempted to hack the site.

These are just a handful of relatively rudimentary examples of the ways in which logs files can be creatively leveraged to yield valuable usage information and even basic analytics.

Other Things to Keep in Mind

Keeping Things Secure

Another heads-up is for security. You might think that logging requests is a good idea, in most cases it indeed is. However, it’s important to be extremely careful about removing any potentially sensitive user information before storing it in the log.

Fighting Log File Bloat

Since log files are text files to which you always append information, they are constantly growing. Since this is a well-known issue, there are some fairly standard approaches to controlling log file growth.

The easiest is to rotate the logs. Rotating logs means:

  • Periodically replacing the log with a new empty file for further writing
  • Storing the old file for history
  • Removing files that have “aged” sufficiently to free up disk space
  • Making sure the application can write to the logs uniterrupted when these file changes occur

The most common solution for this is logrotate, which ships pre-installed with most *nix distributions. Let’s see a simple configuration file for rotating our logs:

/var/log/my/debug.log
/var/log/my/info.log
/var/log/my/warning.log
/var/log/my/error.log
{
    rotate 7
    daily
    missingok
    notifempty
    delaycompress
    compress
    sharedscripts
    postrotate
        invoke-rc.d rsyslog rotate > /dev/null
    endscript
}

Another, more advanced approach is to make rsyslogd itself write messages into files, dynamically created based on current date and time. This would still require a custom solution for removal of older files, but lets devops manage timeframes for each log file precisely. For our example:

$template DynaLocal0Err,        "/var/log/my/error-%$NOW%-%$HOUR%.log"
$template DynaLocal0Info,       "/var/log/my/info-%$NOW%-%$HOUR%.log"
$template DynaLocal0Warning,    "/var/log/my/warning-%$NOW%-%$HOUR%.log"
$template DynaLocal0Debug,      "/var/log/my/debug-%$NOW%-%$HOUR%.log"
local1.err      -?DynaLocal0Err
local1.info     -?DynaLocal0Info
local1.warning  -?DynaLocal0Warning
local1.debug    -?DynaLocal0Debug

This way, rsyslog will create an individual log file each hour, and there won’t be any need for rotating them and restarting the daemon. Here’s how log files older than 5 days can be removed to accomplish this solution:

find /var/log/my/ -mtime +5 -print0 | xargs -0 rm

Remote Logs

As the project grows, parsing information from logs gets more and more resource hungry. This not only means creating extra server load; it also means creating peak load on the CPU and disk drives at the times when you parse logs, which can degrade server response time for users (or in a worst case can even bring the site down).

To solve this, consider setting up a centralized logging server. All you need for this is another box with UDP port 514 (default) open. To make rsyslogd listen to connections, add the following line to its config file:

$UDPServerRun 514

Having this, setting up the client is then as easy as:

*.*   @HOSTNAME:514

(where HOSTNAME is the host name of your remote logging server).

Conclusion

While this article has demonstrated some of the creative ways in which log files can offer way more valuable information than you may have previously imagined, it’s important to emphasize that we’ve only scratched the surface of what’s possible. The extent, scope, and format of what you can log is almost limitless. This means that – if there’s usage or analytics data you want to extract from your logs – you simply need to log it in a way that will make it easy to subsequently parse and analyze. Moreover, that analysis can often be performed with standard Linux command line tools like grep, sed, or awk.

Indeed, PHP log files are a most powerful tool that can be of tremendous benefit.

Resources

Code on GitHub: https://github.com/isanosyan/toptal-blog-logs-post-example


Appendix: Reading and Manipulating Log Files in the Unix Shell

Here is a brief intro to some of the more common *nix command line tools that you’ll want to be familiar with for reading and manipulating your log files.

  • cat is perhaps the most simple one. It prints the whole file to the output stream. For example, the following command will print logfile1 to the console:
    cat logfile1
    
  • > character allows user to redirect output, for example into another file. Opens target stream in write mode (which means wiping target contents). Here’s how we replace contents of tmpfile with contents of logfile1:
    cat logfile1 > tmpfile
    
  • >> redirects output and opens target stream in append mode. Current contents of target file will be preserved, new lines will be added to the bottom. This will append logfile1 contents to tmpfile:
    cat logfile1 >> tmpfile
    
  • grep filters file by some pattern and prints only matching lines. Command below will only print lines of logfile1 containing Bingo message:
    grep Bingo logfile1
    
  • cut prints contents of a single column (by number starting from 1). By default searches for tab characters as delimiters between column. For example, if you have file full of timestamps in format YYYY-MM-DD HH:MM:SS, this will allow you to print only years:
    cut -d"-" -f1 logfile1
    
  • head displays only the first lines of a file
  • tail displays only the last lines of a file
  • sort sorts lines in the output
  • uniq filters out duplicate lines
  • wc counts words (or lines when used with the -l flag)
  • | (i.e., the “pipe” symbol) supplies output from one command as input to the next. Pipe is very convenient for combining commands. For example, here’s how we can find months of 2014 that occur within a set of timestamps:
    grep -E "^2014" logfile1 | cut -d"-" -f2 | sort | uniq
    

Here we first match lines against regular expression “starts with 2014”, then cut months. Finally, we use combination of sort and uniq to print occurrences only once.

This article originally appeared on Toptal

9 Essential System Security Interview Questions

  1. What is a pentest?

“Pentest” is short for “penetration test”, and involves having a trusted security expert attack a system for the purpose of discovering, and repairing, security vulnerabilities before malicious attackers can exploit them. This is a critical procedure for securing a system, as the alternative method for discovering vulnerabilities is to wait for unknown agents to exploit them. By this time it is, of course, too late to do anything about them.

In order to keep a system secure, it is advisable to conduct a pentest on a regular basis, especially when new technology is added to the stack, or vulnerabilities are exposed in your current stack.

 

2. What is social engineering?

“Social engineering” refers to the use of humans as an attack vector to compromise a system. It involves fooling or otherwise manipulating human personnel into revealing information or performing actions on the attacker’s behalf. Social engineering is known to be a very effective attack strategy, since even the strongest security system can be compromised by a single poor decision. In some cases, highly secure systems that cannot be penetrated by computer or cryptographic means, can be compromised by simply calling a member of the target organization on the phone and impersonating a colleague or IT professional.

Common social engineering techniques include phishing, clickjacking, and baiting, although several other tricks are at an attacker’s disposal. Baiting with foreign USB drives was famously used to introduce the Stuxnet worm into Iran’s uranium enrichment facilities, damaging the nation’s ability to produce nuclear material.

For more information, a good read is Christopher Hadnagy’s book Social Engineering: The Art of Human Hacking.

3. You find PHP queries overtly in the URL, such as /index.php=?page=userID. What would you then be looking to test? 

This is an ideal situation for injection and querying. If we know that the server is using a database such as SQL with a PHP controller, it becomes quite easy. We would be looking to test how the server reacts to multiple different types of requests, and what it throws back, looking for anomalies and errors.

One example could be code injection. If the server is not using authentication and evaluating each user, one could simply try /index.php?arg=1;system(‘id’) and see if the host returns unintended data.

4. You find yourself in an airport in the depths of of a foreign superpower. You’re out of mobile broadband and don’t trust the WI-FI. What do you do? Further, what are the potential threats from open WI-FIs?

Ideally you want all of your data to pass through an encrypted connection. This would usually entail tunneling via SSH into whatever outside service you need, over a virtual private network (VPN). Otherwise, you’re vulnerable to all manner of attacks, from man-in-the-middle, to captive portals exploitation, and so on.

5. What does it mean for a machine to have an “air gap”? Why are air gapped machines important?

An air gapped machine is simply one that cannot connect to any outside agents. From the highest level being the internet, to the lowest being an intranet or even bluetooth.

Air gapped machines are isolated from other computers, and are important for storing sensitive data or carrying out critical tasks that should be immune from outside interference. For example, a nuclear power plant should be operated from computers that are behind a full air gap. For the most part, real world air gapped computers are usually connected to some form of intranet in order to make data transfer and process execution easier. However, every connection increases the risk that outside actors will be able to penetrate the system.

 

6. You’re tasked with setting up an email encryption system for certain employees of a company. What’s the first thing you should be doing to set them up? How would you distribute the keys?

The first task is to do a full clean and make sure that the employees’ machines aren’t compromised in any way. This would usually involve something along the lines of a selective backup. One would take only the very necessary files from one computer and copy them to a clean replica of the new host. We give the replica an internet connection and watch for any suspicious outgoing or incoming activity. Then one would perform a full secure erase on the employee’s original machine, to delete everything right down to the last data tick, before finally restoring the backed up files.

The keys should then be given out by transferring them over wire through a machine or device with no other connections, importing any necessary .p7s email certificate files into a trusted email client, then securely deleting any trace of the certificate on the originating computer.

The first step, cleaning the computers, may seem long and laborious. Theoretically, if you are 100% certain that the machine is in no way affected by any malicious scripts, then of course there is no need for such a process. However in most cases, you’ll never know this for sure, and if any machine has been backdoored in any kind of way, this will usually mean that setting up secure email will be done in vain.

7. You manage to capture email packets from a sender that are encrypted through Pretty Good Privacy (PGP). What are the most viable options to circumvent this?

First, one should be considering whether to even attempt circumventing the encryption directly. Decryption is nearly impossible here unless you already happen to have the private key. Without this, your computer will be spending multiple lifetimes trying to decrypt a 2048-bit key. It’s likely far easier to simply compromise an end node (i.e. the sender or receiver). This could involve phishing, exploiting the sending host to try and uncover the private key, or compromising the receiver to be able to view the emails as plain text.

8. What makes a script fully undetectable (FUD) to antivirus software? How would you go about writing a FUD script? 

A script is FUD to an antivirus when it can infect a target machine and operate without being noticed on that machine by that AV. This usually entails a script that is simple, small, and precise

To know how to write a FUD script, one must understand what the targeted antivirus is actually looking for. If the script contains events such as Hook_Keyboard(), File_Delete(), or File_Copy(), it’s very likely it wil be picked up by antivirus scanners, so these events are not used. Further, FUD scripts will often mask function names with common names used in the industry, rather than naming them things like fToPwn1337(). A talented attacker might even break up his or her files into smaller chunks, and then hex edit each individual file, thereby making it even more unlikely to be detected.

As antivirus software becomes more and more sophisticated, attackers become more sophisticated in response. Antivirus software such as McAfee is much harder to fool now than it was 10 years ago. However, there are talented hackers everywhere who are more than capable of writing fully undetectable scripts, and who will continue to do so. Virus protection is very much a cat and mouse game.

9. What is a “Man-in-the-Middle” attack?

A man-in-the-middle attack is one in which the attacker secretly relays and possibly alters the communication between two parties who believe they are directly communicating with each other. One example is active eavesdropping, in which the attacker makes independent connections with the victims and relays messages between them to make them believe they are talking directly to each other over a private connection, when in fact the entire conversation is controlled by the attacker, who even has the ability to modify the content of each message. Often abbreviated to MITM, MitM, or MITMA, and sometimes referred to as a session hijacking attack, it has a strong chance of success if the attacker can impersonate each party to the satisfaction of the other. MITM attacks pose a serious threat to online security because they give the attacker the ability to capture and manipulate sensitive information in real-time while posing as a trusted party during transactions, conversations, and the transfer of data. This is straightforward in many circumstances; for example, an attacker within reception range of an unencrypted WiFi access point, can insert himself as a man-in-the-middle.

This article is from Toptal.