Did any of you write a webcrawler ?

Ariesnl

I thought it would be fun to write a web crawling program that indexes and maybe follows links based on some rules that tell it if a page is "interesting"

No idea where to start though

Chris Katko

You could probably just use cURL or wget with a bash or python script to handle all the downloading specifics.

It can't be that hard. Just make sure you limit the amount of steps you follow.

NO IDEA how they handle the modern web 2.0 interfaces though that can change asynchronously without reloading like Facebook.

If I had "no idea where to start" I'd probably just google some Stack Overflow questions that were similar and read all of them.

Eric Johnson

I wrote a small PHP script a few years ago that would download images from Web sites, if that counts. It basically worked by scanning a Web page for image tags, then it would download said images. It couldn't handle anything generated via AJAX though, obviously. I later made it follow links on pages, but discontinued that when it stumbled upon a porn site... :-X

I think you can accomplish the same things with wget though.

MiquelFire

I did make something in PHP one time, it would download this index page so it could find the correct sub pages (it was possible for this list to change over time actually), and download each page to generate a bunch of TXT files with some data from those sub pages' tables. I had a limit of like 2 seconds between each hit as to not hit the server too hard.

I meant to do more with it, but never found the time (and when I had no reason to use the files myself, I had no reason to update it to work with the site's redesign, as I noticed no one else was using it but me)

bamccaig

I'd say that "interesting" is extremely advanced. Keep in mind the search engines we have today have been in the works for decades. They didn't get that advanced overnight, or with a single man crew.

Some simple things your crawler would need:

Ability to make an HTTP request to a server and receive the response.
Ability to parse the [X]HTML response to find the links within it.
A database to keep track of where you've been and when so you don't go in circles.
Within the database index what you can about where you've been. This is where it starts to get complicated. I suggest keep it simple for the purposes of this project. Once you have it working small you can expand on it if you're still interested, but nothing fancy is necessary for a proof of concept.

Those are some pretty basic goals to get started with. Don't try to roll your own because each of these can easily take weeks to do 100% fully. If you can find libraries to do it for you then you'll save a lot of time. It's not that you couldn't write it all yourself. It's that there's no real value in doing it again since many others have already spent the time and money to do a better job than you have the time or money to do, and they've been kind/generous enough to share the fruits of their labors so you don't have to reinvent the square wheel.

There are various standard files that you should research too so that you obey them, such as /robot.txt which I believe describes to crawlers what resources are encouraged to index and which resources should remain secret (or if crawling is frowned upon in general).

You'll also probably want to narrow the scope of how far your crawler goes while you work out the bugs and figure out the netiquette rules. You won't want somebody the size of Google summoning you to court (albeit, I don't think Google would be hurt by your bot, but some smaller fish might be, and they still might be big enough to eat you).

A language such as Perl or Python would make this much easier. Not only do they have excellent libraries for these kinds of things, but they also have easy access to Unicode strings and databases and the like. Whereas if you attempt to do this in C or C++ you'll probably have to write 10x or 100x more code for the same job. And you won't need the things C or C++ are good at right away, if at all, so you might as well optimize for progress instead of performance.

Ariesnl

How about giving links that are more unlike the current page a prio..
that would reduce the workload to a single server ..
I already have a working skeleton code in C#

#SelectExpand
   1using System;
   2using System.Collections.Generic;
   3using System.Linq;
   4using System.Text;
   5using System.Text.RegularExpressions;
   6using System.IO;
   7using System.Net;
   8
   9namespace Crawler1
  10{
  class Program
  {
      static List<string> history = new List<string>();
      static List<string> urllist = new List<string>();
      static string strCurrent;
      static StreamWriter sw = new StreamWriter("Links.txt");
  17
      static void Main(string[] args)
      {
         
          string page = GetPage("www.startpagina.nl/"); // <-- starting page
  22
          bool blQuit = false;
  24
          while (!blQuit)
          {
              if (page.Length > 0)
              {
                  // grab urls from page
        foreach ( string st in GetURLs(page))
                  {
                      urllist.Add(st);
                  }
  34
              }
               
               
       int l=0;  // look for links that where not visited before
               foreach( string s in urllist)
               {
                   if (!history.Contains(s))
                   {
                       l++;
                       Console.WriteLine("! ==>");
           history.Add(strCurrent);
           urllist.Remove(strCurrent);   // ToDo: make some intelligent sort   high prio links should come first
                       page = GetPage(s);
                       Console.WriteLine(s);
                       if (IsInteresting(page))
                       {
                           sw.WriteLine(s);
                           sw.Flush();
                           Console.WriteLine("[LOGGED !]");
                       }
                   }
               }
       
               // no more links to follow, track back
       if (l==0)
               {
                   if (history.Count > 0)
                   {
                       page = history.Last();
                       history.RemoveAt(history.Count - 1);
                       Console.WriteLine("<== !");
                   }
                   else
                   {
                       blQuit = true;
                       Console.WriteLine("Ready...");
                   }
               }
          
          }
  
          Console.ReadKey();
      }
  78
  
      static bool IsInteresting(string page)
      {
          bool blResult = false;
          String[] words = page.Split(' ');
          double count = 0;
          foreach (string word in words)
          {
              if (word.ToLower() == "auto")
              {
                  count++;
              }
          }
          if (count > 5)
          {
              blResult = true;
          }
  96
          return blResult;
      }
      
 100
 101
      static string GetPage(string url)
      {
          string html = "";
          try
          {
              WebRequest request = WebRequest.Create(@"http:\\" + url);
              WebResponse response = request.GetResponse();
              Stream data = response.GetResponseStream();
              html = String.Empty;
              using (StreamReader sr = new StreamReader(data))
              {
                  html = sr.ReadToEnd();
              }
              strCurrent = url;
          }
          catch (Exception e)
          {
 119
          }
          return html;
      }
 123
 124
 125
 126
 127
      static List<string> GetURLs(string a_str)
      {
          List<string> res = new List<string>();
 131
          string[] strs = Regex.Split(a_str, @"[\n:|<|>]");
 133
          foreach (string ss in strs)
          {
              MatchCollection mc = Regex.Matches(ss, @"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:#@%/;$()~_?\+-=\\\.&]*)");
              foreach (Match m in mc)
              {
                  res.Add(m.Value);
              }
          }
          res.Sort();
          return res;
      }
 145
      static void Print(List<string> list)
      {
          foreach (string url in list)
          {
              Console.WriteLine(url);
          }
      }
 153
  }
 155
 156}

It works, but I can see some bugs right now..
I was tired yesterday evening.. but still it works ( sort of)

GullRaDriel

Just curl it baby. 8-)

Thread #617003. Printed from Allegro.cc