Did any of you write a webcrawler ?
Ariesnl

I thought it would be fun to write a web crawling program that indexes and maybe follows links based on some rules that tell it if a page is "interesting"

No idea where to start though

Chris Katko

You could probably just use cURL or wget with a bash or python script to handle all the downloading specifics.

It can't be that hard. Just make sure you limit the amount of steps you follow.

NO IDEA how they handle the modern web 2.0 interfaces though that can change asynchronously without reloading like Facebook.

If I had "no idea where to start" I'd probably just google some Stack Overflow questions that were similar and read all of them.

Eric Johnson

I wrote a small PHP script a few years ago that would download images from Web sites, if that counts. It basically worked by scanning a Web page for image tags, then it would download said images. It couldn't handle anything generated via AJAX though, obviously. I later made it follow links on pages, but discontinued that when it stumbled upon a porn site... :-X

I think you can accomplish the same things with wget though.

MiquelFire

I did make something in PHP one time, it would download this index page so it could find the correct sub pages (it was possible for this list to change over time actually), and download each page to generate a bunch of TXT files with some data from those sub pages' tables. I had a limit of like 2 seconds between each hit as to not hit the server too hard.

I meant to do more with it, but never found the time (and when I had no reason to use the files myself, I had no reason to update it to work with the site's redesign, as I noticed no one else was using it but me)

bamccaig

I'd say that "interesting" is extremely advanced. Keep in mind the search engines we have today have been in the works for decades. They didn't get that advanced overnight, or with a single man crew.

Some simple things your crawler would need:

  • Ability to make an HTTP request to a server and receive the response.

  • Ability to parse the [X]HTML response to find the links within it.

  • A database to keep track of where you've been and when so you don't go in circles.

  • Within the database index what you can about where you've been. This is where it starts to get complicated. I suggest keep it simple for the purposes of this project. Once you have it working small you can expand on it if you're still interested, but nothing fancy is necessary for a proof of concept.

Those are some pretty basic goals to get started with. Don't try to roll your own because each of these can easily take weeks to do 100% fully. If you can find libraries to do it for you then you'll save a lot of time. It's not that you couldn't write it all yourself. It's that there's no real value in doing it again since many others have already spent the time and money to do a better job than you have the time or money to do, and they've been kind/generous enough to share the fruits of their labors so you don't have to reinvent the square wheel.

There are various standard files that you should research too so that you obey them, such as /robot.txt which I believe describes to crawlers what resources are encouraged to index and which resources should remain secret (or if crawling is frowned upon in general).

You'll also probably want to narrow the scope of how far your crawler goes while you work out the bugs and figure out the netiquette rules. You won't want somebody the size of Google summoning you to court (albeit, I don't think Google would be hurt by your bot, but some smaller fish might be, and they still might be big enough to eat you). ;)

A language such as Perl or Python would make this much easier. Not only do they have excellent libraries for these kinds of things, but they also have easy access to Unicode strings and databases and the like. Whereas if you attempt to do this in C or C++ you'll probably have to write 10x or 100x more code for the same job. And you won't need the things C or C++ are good at right away, if at all, so you might as well optimize for progress instead of performance.

Ariesnl

How about giving links that are more unlike the current page a prio..
that would reduce the workload to a single server ..
I already have a working skeleton code in C#

#SelectExpand
1using System; 2using System.Collections.Generic; 3using System.Linq; 4using System.Text; 5using System.Text.RegularExpressions; 6using System.IO; 7using System.Net; 8 9namespace Crawler1 10{ 11 class Program 12 { 13 static List<string> history = new List<string>(); 14 static List<string> urllist = new List<string>(); 15 static string strCurrent; 16 static StreamWriter sw = new StreamWriter("Links.txt"); 17 18 static void Main(string[] args) 19 { 20 21 string page = GetPage("www.startpagina.nl/"); // <-- starting page 22 23 bool blQuit = false; 24 25 while (!blQuit) 26 { 27 if (page.Length > 0) 28 { 29 // grab urls from page 30 foreach ( string st in GetURLs(page)) 31 { 32 urllist.Add(st); 33 } 34 35 } 36 37 38 int l=0; // look for links that where not visited before 39 foreach( string s in urllist) 40 { 41 if (!history.Contains(s)) 42 { 43 l++; 44 Console.WriteLine("! ==>"); 45 history.Add(strCurrent); 46 urllist.Remove(strCurrent); // ToDo: make some intelligent sort high prio links should come first 47 page = GetPage(s); 48 Console.WriteLine(s); 49 if (IsInteresting(page)) 50 { 51 sw.WriteLine(s); 52 sw.Flush(); 53 Console.WriteLine("[LOGGED !]"); 54 } 55 } 56 } 57 58 // no more links to follow, track back 59 if (l==0) 60 { 61 if (history.Count > 0) 62 { 63 page = history.Last(); 64 history.RemoveAt(history.Count - 1); 65 Console.WriteLine("<== !"); 66 } 67 else 68 { 69 blQuit = true; 70 Console.WriteLine("Ready..."); 71 } 72 } 73 74 } 75 76 Console.ReadKey(); 77 } 78 79 80 static bool IsInteresting(string page) 81 { 82 bool blResult = false; 83 String[] words = page.Split(' '); 84 double count = 0; 85 foreach (string word in words) 86 { 87 if (word.ToLower() == "auto") 88 { 89 count++; 90 } 91 } 92 if (count > 5) 93 { 94 blResult = true; 95 } 96 97 return blResult; 98 } 99 100 101 102 static string GetPage(string url) 103 { 104 string html = ""; 105 try 106 { 107 WebRequest request = WebRequest.Create(@"http:\\" + url); 108 WebResponse response = request.GetResponse(); 109 Stream data = response.GetResponseStream(); 110 html = String.Empty; 111 using (StreamReader sr = new StreamReader(data)) 112 { 113 html = sr.ReadToEnd(); 114 } 115 strCurrent = url; 116 } 117 catch (Exception e) 118 { 119 120 } 121 return html; 122 } 123 124 125 126 127 128 static List<string> GetURLs(string a_str) 129 { 130 List<string> res = new List<string>(); 131 132 string[] strs = Regex.Split(a_str, @"[\n:|<|>]"); 133 134 foreach (string ss in strs) 135 { 136 MatchCollection mc = Regex.Matches(ss, @"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:#@%/;$()~_?\+-=\\\.&]*)"); 137 foreach (Match m in mc) 138 { 139 res.Add(m.Value); 140 } 141 } 142 res.Sort(); 143 return res; 144 } 145 146 static void Print(List<string> list) 147 { 148 foreach (string url in list) 149 { 150 Console.WriteLine(url); 151 } 152 } 153 154 } 155 156}

It works, but I can see some bugs right now..
I was tired yesterday evening.. but still it works ( sort of)

GullRaDriel

Just curl it baby. 8-)

Thread #617003. Printed from Allegro.cc