Project Luther - Regex, unicode, python versions

For Project Luther, the second project out of the five we will do while in Metis, we are to analyze data obtained from scraping the Box Office Mojo website with tools such as BeautifulSoup (bs4)) and Selenium. We’re expected to use pandas and to predict something useful from the data (I’m still brainstorming as to what I want to predict) using linear regression (statsmodels – no scikit learn yet).

I am still at the planning stages of my project, trying to get some data out of a single website, and have already encountered an issue with bs4 and unicode … I don’t like working with unicode text. I’ve ran into issues with unicode before when working on a twitter sentiment analysis project. Should I just install a virtualenv using conda and install python 3.5 on it?

So much to do, so litle time! I’m still working on finishing the Benson challenges.

Here are some useful links to keep handy for this project:

regex:
- RegExr
- Pythex
- Tutorial
html/css:
- Tutorial

I want to find a tool that allows me to visualize siblings, parents, etc of html tags. Such a thing must exist (if not, it needs to be created!) I will continue googling.

Written on September 27, 2016