SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 

Practical Web Scraping–Getting Started

As part of my learning goals for 2018, I wanted to work through various books. This is part of  my work with Python.

After going through a few first chapters, I decided to start my February learning with Practice Web Scraping for Data Science, which looks at data acquisition using Python to pull data from the web. I found the book interesting and also this would be a nice setup for two of the other books (Power BI and natural language processing).

Like many people, I find lots of data on the web, but I’m constantly struggling to get it into a database. I find myself going through gyrations at times to get data. Even with the cool features of Power BI, it hasn’t been as smooth as I’d like to get data, so I thought this would be a good book.

Part 1

The first few chapters of the book are about the basics of web scraping. We learn what this means, and get a little bit of a tutorial on who uses this technique, with some specific examples. We also get a basic python tutorial, which I skimmed. I know a bit about Python and this was a very basic, getting started.

The next part of the early book deals with the basics of what http transport looks like and how some networking works. This is interesting to me, though not sure it matters for scraping. We’ll see. There is some discussion on GET and the http standard, so perhaps that’s helpful. It is good to at least know what codes might come back or what headers or parameters you need to use.

The third chapter starts to get code working.  It opens with a discussion of HTML and how you can examine the structure of pages in your browser. This is a good reminder and basic tutorial of some of the web page developer tools that exist in your browser and that you might want to use when trying to build applications, especially those that scrape pages. There is also a basic CSS tutorial, which was good as I needed a little refresher. I rarely deal with CSS stuff, leaving that to others.

The last part of the chapter starts with the BeautifulSoup library, which is built to parse out text, and specifically, makes working with HTML easier. The examples are with a Wikipedia Game of Thrones page, but I added some examples, trying to translate this to a sports page. It worked OK, and I learned a few things.

The last part looks at Regular Expressions with BeautifulSoup and how you can search out elements and then start to copy data. It’s more complex and tedious, but then again, lots of programming is tedious. Once it’s working, it’s amazing.

Experimenting

I started to work with this in Azure Notebooks as a different way of tracking some work in Python. I’ll want to store things in a file at some point, but for now, this lets me start and stop learning and keep track of where I am without worrying about files and names.

Not sure if anyone can access it (it’s marked public), but my project and notebooks are here: https://notebooks.azure.com/way0utwest/projects/web-scraping-with-python

I ran some of the early scripts, which are just getting you used to working with Python and accessing web pages. I then copied some examples from my Calibre view of the book and executed them. I even tried to experiment a bit.

One note: copying the code seems to leave some invalid character in there for Azure Notebooks, so I ended up editing the beginning of every line to remove the offensive character.

This got me the basics of working with web scraping. Now to try and grab some data from another page and see what I’ve learned.

The Voice of the DBA

Steve Jones is the editor of SQLServerCentral.com and visits a wide variety of data related topics in his daily editorial. Steve has spent years working as a DBA and general purpose Windows administrator, primarily working with SQL Server since it was ported from Sybase in 1990. You can follow Steve on Twitter at twitter.com/way0utwest

Comments

Leave a comment on the original post [voiceofthedba.com, opens in a new window]

Loading comments...