Writing the Map Reduce program in Java to analyze web log data_Hadoop Real-World Solutions Cookbook（Second Edition）-QQ阅读女生古言网

上QQ阅读APP看书，第一时间看更新

Writing the Map Reduce program in Java to analyze web log data

In this recipe, we are going to take a look at how to write a map reduce program to analyze web logs. Web logs are data that is generated by web servers for requests they receive. There are various web servers such as Apache, Nginx, Tomcat, and so on. Each web server logs data in a specific format. In this recipe, we are going to use data from the Apache Web Server, which is in combined access logs.

Note

To read more on combined access logs, refer to

http://httpd.apache.org/docs/1.3/logs.html#combined.

Getting ready

To perform this recipe, you should already have a running Hadoop cluster as well as an eclipse similar to an IDE.

How to do it...

We can write map reduce programs to analyze various aspects of web log data. In this recipe, we are going to write a map reduce program that reads a web log file, results pages, views, and their counts. Here is some sample web log data we'll consider as input for our program:

106.208.17.105 - - [12/Nov/2015:21:20:32 -0800] "GET /tutorials/mapreduce/advanced-map-reduce-examples-1.html HTTP/1.1" 200 0 "https://www.google.co.in/" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
60.250.32.153 - - [12/Nov/2015:21:42:14 -0800] "GET /tutorials/elasticsearch/install-elasticsearch-kibana-logstash-on-windows.html HTTP/1.1" 304 0 - "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36" 
49.49.250.23 - - [12/Nov/2015:21:40:56 -0800] "GET /tutorials/hadoop/images/internals-of-hdfs-file-read-operations/HDFS_Read_Write.png HTTP/1.1" 200 0 "http://hadooptutorials.co.in/tutorials/spark/install-apache-spark-on-ubuntu.html" "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; LCTE; rv:11.0) like Gecko"
60.250.32.153 - - [12/Nov/2015:21:36:01 -0800] "GET /tutorials/elasticsearch/install-elasticsearch-kibana-logstash-on-windows.html HTTP/1.1" 200 0 - "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
91.200.12.136 - - [12/Nov/2015:21:30:14 -0800] "GET /tutorials/hadoop/hadoop-fundamentals.html HTTP/1.1" 200 0 "http://hadooptutorials.co.in/tutorials/hadoop/hadoop-fundamentals.html" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.99 Safari/537.36"

These combined Apache Access logs are in a specific format. Here is the sequence and meaning of each component in each access log:

%h: This is the remote host (that is, the IP client)
%l: This is the identity of the user determined by an identifier (this is not usually used since it's not reliable)
%u: This is the username determined by the HTTP authentication
%t: This is the time the server takes to finish processing a request
%r: This is the request line from the client ("GET / HTTP/1.0")
%>s: This is the status code sent from a server to a client (200, 404, and so on)
%b: This is the size of the response given to a client (in bytes)
Referrer: This is the page that is linked to this URL
User agent: This is the browser identification string

Now, let's start a writing program in order to get to know the page views of each unique URL that we have in our web logs.

First, we will write a mapper class where we will read each and every line and parse it to the extract page URL. Here, we will use a Java pattern that matches a utility in order to extract information:

public static class PageViewMapper extends Mapper<Object, Text, Text, IntWritable> {
        public static String APACHE_ACCESS_LOGS_PATTERN = "^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(\\S+) (\\S+) (\\S+)\" (\\d{3}) (\\d+) (.+?) \"([^\"]+|(.+?))\"";

        public static Pattern pattern = Pattern.compile(APACHE_ACCESS_LOGS_PATTERN);

        private static final IntWritable one = new IntWritable(1);
        private Text url = new Text();

        public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)
                throws IOException, InterruptedException {
        Matcher matcher = pattern.matcher(value.toString());
            if (matcher.matches()) {
                // Group 6 as we want only Page URL
                url.set(matcher.group(6));
                System.out.println(url.toString());
                context.write(this.url, one);
            }

        }
    }

In the preceding mapper class, we read key value pairs from the text file. By default, the key is a byte offset (the number of characters in a line), and the value is an actual line in a text file. Next, we match the line with the Apache Access Log regex pattern so that we can extract the exact information we need. For a page view counter, we only need a URL. Mapper outputs the URL as a key and 1 as the value. So, we can count these URL in reducer.

Here is the reducer class that sums up the output values of the mapper class:

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                Reducer<Text, IntWritable, Text, IntWritable>.Context context)
                        throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            this.result.set(sum);
            context.write(key, this.result);
        }
    }

Now, we just need a driver class to call these mappers and reducers:

public class PageViewCounter {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        if (args.length != 2) {
            System.err.println("Usage: PageViewCounter <in><out>");
            System.exit(2);
        }
        Job job = Job.getInstance(conf, "Page View Counter");
        job.setJarByClass(PageViewCounter.class);
        job.setMapperClass(PageViewMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

As the operation we are performing is aggregation, we can also use a combiner here to optimize the results. Here, the same reducer logic is being used as the one used for the combiner.

To compile your program properly, you need to add two external JARs, hadoop-common-2.7.jar, which can be found in the /usr/local/hadoop/share/hadoop/common folder and hadoop-mapreduce-client-core-2.7.jar, which can be found in the /usr/local/hadoop/share/hadoop/mapreduce path.

Make sure you add these two JARs in your build path so that your program can be compiled easily.

How it works...

The page view counter program helps us find the most popular pages, least accessed pages, and so on. Such information helps us make decisions about the ranking of pages, frequency of visits, and the relevance of a page. When a program is executed, each line of the HDFS block is read inpidually and then sent to Mapper. Mapper matches the input line with the log format and extracts its page URL. Mapper emits the (URL,1) type of key value pairs. These pairs are shuffled across nodes and partitioners to make sure that a similar URL goes to only one reducer. Once received by the reducers, we add up all the values for each key and emit them. This way, we get results in the form of a URL and the number of times it was accessed.