Capturing web pages

  • We currently have a person who visits hundreds of web pages each month. He saves the source of each page to a file and then runs a Excel VBA Macro to extract a few pieces of information from each page. This information is then used for various reports.

    It would be a lot cheaper/quicker to get a PC/server to visit these pages than a person. We want to know if we can automate the capturing of the data from the web pages.

    We have seen a demo of a product which would automate this but at $150,000 it is way out of our price bracket - budget for new software is zero.

    There is a file of URLs and we need to take each one, pump it into IE and then save the source of the web page that comes back.

    The tools we have are Office, VB (Visual Studio at a push) and SQL Server.

    Does anyone know of a means to do this?

    Thanks.

    Jeremy

    Edited by - Jeremy Kemp on 11/04/2003 08:10:15 AM

  • I guess that's what spiders or crawlers do all the time.

    I am inclined to suggest you take a look at some hacking sites for relevant source code. of course, stricly for educational purposes.

    Btw, how's the payment for a this job

    Frank

    http://www.insidesql.de

    http://www.familienzirkus.de

    --
    Frank Kalis
    Microsoft SQL Server MVP
    Webmaster: http://www.insidesql.org/blogs
    My blog: http://www.insidesql.org/blogs/frankkalis/[/url]

  • It's been a looong time, but as I recall it is possible. Take a look on MSDN, might try searching for 'screen scrape'.

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • You could also invoke the XMLHTTP object through SQL.

    Using this, you write the retrieved source to SQL.

    
    

    Set NoCount On

    Declare
    @vRootURLVarchar(100),
    @vFileNamevarchar(20),
    @vChannelIDInt,
    @oHTTPInt,
    @vErrInt,
    @vDLengthInt,
    @vCurLengthInt,
    @vStartInt,
    @vEndInt,
    @vFinalURLVarchar(500)

    Create Table #T(Data Text)
    Create Table #T2(Ident Int Not Null Identity(1, 1), Data char(4))

    Set @vRootURL = 'http://localhost/'
    Set @vFileName = 'guide2.htm'

    Set @vFinalURL = @vRootURL + @vFileName

    Exec @vErr = sp_OACreate 'Msxml2.XMLHTTP.3.0', @oHTTP OUT
    Exec @vErr = sp_OAMethod @oHTTP, 'Open', NULL, 'GET', @vFinalURL, False
    Exec @vErr = sp_OAMethod @oHTTP, 'Send'
    Insert Into #T Exec @vErr = sp_OAMethod @oHTTP, 'responseText'--, @vResponse OUTPUT
    EXEC sp_OADestroy @oHTTP

    Select * From #T

    Drop Table #T
    Drop Table #T2

    This thread also explains it has has more error handling in the source. (THe code above I am still working on for something else )

    http://www.sqlservercentral.com/forum/topic.asp?TOPIC_ID=11078&FORUM_ID=9&CAT_ID=1&Topic_Title=dead+links&Forum_Title=General

    Cheers,

    Crispin

    Cheers,CrispinI can't die, there are too many people who still have to meet me!It's not a bug, SQL just misunderstood me!

  • Should mention:

    Exec @vErr = sp_OAMethod @oHTTP, 'responseText', @vResponse OUTPUT

    If the final param is supplied, sp_OAMethod returns the result to tis variable.

    If not, it returns it as a record.

    I find the latter a better method (Most of the time) because occasionally the result is greater than 8000 (Varchar 8000)

    Cheers,

    Crispin

    Cheers,CrispinI can't die, there are too many people who still have to meet me!It's not a bug, SQL just misunderstood me!

  • U may want to try "lynx" and "grep" for the required information if you have access to some kinda Unix box to spare.

    infashes


    infashes

  • Yes this is called screen scraping. I have done this a couple of times in VB and is fairly straightforward. Similiar to Crispin but in VB I used microsofts Inet control to GET the url as a string. Then you can do any number of methods to extract just the data you want. I assume that your visiting many different sites each of whihc formats the data differently. You could store the regexp pattern needed at each site in a table with the url, then your VB screen scrapper could loop through each url extracting just what is needed. You said hundreds of sites? You should be able to have a routien processes these within an hour maybe.

    Too bad your budget is zero I still have my libraries for this sort of thing and I could put something together inexpensivly certainly not $150,000!!!! thats just a silly number.

    dave@lonecrow.net

    Dave

    Trainmark.com IT Training B2B Marketplace
    (Jobs for IT Instructors)

  • Thanks for all your replies.

    I agree that $150,000 is an obscene amount of money (unless someone wants to pay me that for a job) and it does include a lot of bells and whistles which we don't need (don't all packages include stuff you don't need?).

    My costs come out of a different budget to software expenditure so there is no problem with me getting paid

    Jeremy

  • I have VBA code which performs HTTP GET and POST operations by calling the WinInet dll. You could run it directly from Excel and integrate it with your scraper macros. Let me know if you want it and I will email for you.

  • Planet115,

    That would be great if you could email me what you have got.

    click on my name in the left hand column to see my email address.

    Thanks

    Jeremy

  • Have you ever looked at Excel's Web Query utilities? Allows you to pull data from web pages directly into a spreadsheet. It can be setup and manipulated via the GUI or VBA.

  • I've tried several tools that could automate gathering of data from Internet, but each does things differently and requres you to learn to use it efficiently:

    - iOpus Internet Macros (run from within IE, can log you into the site and grab tables from web page as delimited list to file that you can later import to database)

    - Cleopie (a kind of IE with built in scripting language, can also navigate through web and grab data as text)

    - AutoMate 5.5 (more expensive than the upper two, doesn't look at web pages as tables, but has sophisticated general automation capabilities between programs. You could for example automate browsing and copying of data to excel, and running VBA macro with it).

    Hope it comes handy for somebody.

  • It is not very hard to do that if you know the commands to send to the web server. In one of my grad classes, we had to do something like that using telnet (to get to know the way the servers operate). You can actually automate this with a vb program.

    The to get a page you want to perform a GET command on the web server. You can do this

    GET /index.html HTTP/1.0

    where /index.html is the path/filename of the page on the server you want to retrieve. Of course, you would have to capture all of the info and filter it. I don't think it would be that hard to do.

    Joe Johnson

    NETDIO,LLC.


    Joe Johnson
    NETDIO,LLC.

  • What exactly are you trying to get out from those web pages? Are you looking for pricing of the sort so that you can find price compare? You know most site hates it if you are hitting their site 100,000 in 10 minutes period just because you want to get a little piece of information. Once they know or capture your IP address they simply block your IP address or chat with your Internet provider about your behavior. If there is something that they could do they probably will do for you and work with you.

    mom

  • Back on the hacker end of things...

    Try searching for something called Black Widow. It's sort of a many-purposed piece of software and it might be free. It can map web sites, download entire web sites or just specific types of files, links or email addresses.

    (I heard about this from a friend... Really! I've never used it! Really!)

    "I met Larry Niven at ConClave 27...AND I fixed his computer. How cool is that?"

    (Memoirs of a geek)


    "I met Larry Niven at ConClave 27...AND I fixed his computer. How cool is that?"
    (Memoirs of a geek)

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply