Java Web Crawler

Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).

 

Web crawlers can be created in java using multiple consumer and producer threads along with queues to store the html of a webpage.

This crawler was made as a project  for my systems programming class.

Consumer thread:

[code language="java"]
public class ConsumerThread extends Thread {private static HashMap<String, Integer> userKeywords = new HashMap<>();
 private static HashMap<String, Integer> keywordPages = new HashMap<>();
 private static int totalKeywords;
 private volatile boolean done = false;/**
 * Defines the run method of a ConsumerThread which gets the html of a webpage
 * from the pageQueue and finds the links throughout the page as well as the
 * user keywords
 */
 /*
 * Defines the run method of a ConsumerThread which gets the html of a webpage
 * from the pageQueue and finds the links throughout the page as well as the
 * user keywords
 */
 public void run() {
 while (!done) {// get page text
 Document pageText = PageQueue.getNextPage();// do something with link...
 Elements links = pageText.select("a[href]");
for (Element link : links) {String url = link.absUrl("href");
LinkQueue.addLink(url);}
// find instances of the user entered keywords 
for (String key : userKeywords.keySet()) {String[] brokenUpPage = pageText.toString().split(key);userKeywords.put(key, userKeywords.get(key) + (brokenUpPage.length) - 1);}
 //finds pages with keyword
 for (String key : keywordPages.keySet()) {String[] brokenUpPage = pageText.toString().split(key);if (brokenUpPage.length > 1) {
 keywordPages.put(key, keywordPages.get(key) + 1);
 totalKeywords += (brokenUpPage.length - 1);
 }}}
 }

Leave a Reply

Your email address will not be published. Required fields are marked *