Online data collection using python

Data collection from websites

Article by wanga harron.

One of the core skills for a person working with data is data collection. As a data scientist /statistician/data analyst/data engineer can’t do anything constructive without data, this data must be found hence data collection. Data can be collected in many ways depending on data type and purpose of collection. Here I will look at data collection from a websites using relevant technology. Not long ago when organizations wanted to collect information from websites, they had to hire a group of people to do the collection, people hired had to copy paste a lot of pages this was tiresome and not economical . Until a python library called beautiful soup was invented. This made it easy to scrape(collect data from a website) a lot of pages in websites with ease. Beautiful soup also inspired the creation of R package called rvest which is used for web scraping. Another library in python used for scraping is selenium which can build a bot that can scrape websites with ease. For a person to use the packages listed above they must be good in programming both python and R. If programming isn’t your cup of tea there are some API tools that can be used for scraping websites at a cost this APIS scrape websites and saves the data in terms of json files, remember its easy to work with json files as they can be easily converted to pandas’ data frame for your own analysis.

Let’s look at web scraping in python using beautiful soup package, before we dive into this let me highlight some important information here, some websites are protective of their data and they won’t allow for web scraping make sure you read websites privacy policy before trying to scrape it, secondly before trying to scrape please be respectful don’t just scrape continuously this may affect the functionality of the website. Let’s get the ball rolling now open chrome or any other browser of your choice, open the website of your choice for me I want to get all job titles from Jobs in Kenya - Career Point Kenya a website in Kenya for job advertisement.

Here is the page outlook:

. After opening the website right click on the information you want to scrape the image below shows it

Now click on inspect and the layout below should appear

From the screenshot we can see the html tags on the right side of the page as I said earlier I want to scrape all the job titles the titles lie under the <h2> tag.

Now headover to python environment any ide can work for web scraping for me am going to use jupyter notebook, create a virtual environment first here is the article on how to create a virtual environment in windows www.statsguru.com.

After creating the virtual environment install the following packages:

Code:

!pip install pandas

!pip install beautifulsoup4

Now lets import required packages

Code:

import pandas as pd

import numpy as np

import requests

from bs4 import BeautifulSoup

Now let’s build a function called aronscrapper(you can give the function any name of your choice) that will scrape the website above and create a data frame then convert the data to csv format which can then be opened in excel to enable you perform your own analysis

Code:

def aronscrapper(url):

request=requests.get(url)

soup=BeautifulSoup(request.text,'html.parser')

job_title=[]

for i in soup.find_all('h2'):

job_title.append(i.text)

df=pd.DataFrame({'job_title':job_title})

csv_file=df.to_csv()

return csv_file

Now lets use our function above:

Code:

url='https://www.careerpointkenya.co.ke/jobs/'

my_file=aronscrapper(url)

print(my_file)

The output is a csv file with the name my_file

You can see from the output below I have obtained my information with a simple line of code.

Output:

,job_title

0,Marketing and Branding Executive Job Acen Tria Group

1,Assistant Office Administrator Job NTSA

2,Assistant Supply Chain Management Officer Job NTSA

3,Water Treatment Works Operator Job TAVEVO Water

4,Pump Attendant Job TAVEVO

5,Pipe Fitters Job TAVEVO

6,Loan Officers Job Cherehani Africa-Kakamega

7,Loan Officers Job Cherehani Africa-Busia

8,Loan Officers Job Cherehani Africa-Migori

9,Primary School Teacher Internships TSC(2000 Posts)

10,"Junior Secondary School Teacher Internships TSC(18,000 Posts)"

11,Senior HR Management Officer Job Kericho County

12,Accounts Assistant Job Co-op Bank

13,Internal Auditor Job Co-op Bank

14,Credit Officer Job Co-op Bank

N/B You can use the url of your choice to scrape a website ,and use the technique I have shown you to input the html tag that you want based on the information you want from a website.

Stats for everyone.

Online data collection using python

Comments

Post a Comment

Popular posts from this blog

IMPACTS AND A BRIEF HISTORY OF AI.

Most famous Statisticians of all time.