Online data collection using python

 


Data collection from websites

Article by wanga harron. 



One of the core skills for a person working with data is data collection. As a data scientist /statistician/data analyst/data engineer can’t do anything constructive without data, this data must be found hence data collection. Data can be collected in many ways depending on data type and purpose of collection. Here I will look at data collection from a websites using relevant technology. Not long ago when organizations   wanted to collect information from websites, they had to hire a group of people to do the collection, people hired had to copy paste a lot of pages this was tiresome and not economical . Until a python library called beautiful soup was invented. This made it easy to scrape(collect data from a website) a lot of pages in websites with ease. Beautiful soup also inspired the creation of R package called rvest which is used for web scraping. Another library in python used for scraping is selenium which can build a bot that can scrape websites with ease. For a person to use the packages listed above they must be good in programming  both python and R. If programming isn’t your cup of tea there are some API tools that can be used  for scraping websites at a cost this APIS scrape websites and saves the data in terms of json files, remember its easy to work with json files as they can be easily converted to pandas’ data frame for your own analysis.

 

Let’s look at web scraping in python using beautiful soup package, before we dive into this let me highlight some important information here, some websites are protective of their data and they won’t allow for web scraping make sure you read websites privacy policy before trying to scrape it, secondly before trying to scrape please be respectful don’t just scrape continuously this may affect the functionality of the website. Let’s get the ball rolling now open chrome or any other browser of your choice, open the website of your choice for me I want to get all job titles from Jobs in Kenya - Career Point Kenya a website in Kenya for job advertisement.

 

Here is the page outlook:


. After opening the website right click  on the information you want  to scrape  the image below shows it



Now click on inspect and the layout below should appear



 

From the screenshot we can see the html tags on the right side of the page as I said earlier I want to scrape all the job titles the titles lie under the <h2> tag.

Now headover to python environment any ide can work for web scraping for me am going to use jupyter notebook, create a virtual environment first here is the article on how to create a virtual environment in windows www.statsguru.com.

After creating the virtual environment install the following packages:

Code:

!pip install pandas

!pip install beautifulsoup4

 Now lets import required packages

Code:

import pandas as pd

import numpy as np

import requests

from bs4 import BeautifulSoup

 

Now let’s build a function called aronscrapper(you can give the function any name of your choice)  that will scrape the website above and  create a data frame then convert the data to csv format which can then be opened in excel to enable you perform your own analysis

Code:

 

def aronscrapper(url):

    request=requests.get(url)

    soup=BeautifulSoup(request.text,'html.parser')

    job_title=[]

    for i in soup.find_all('h2'):

        job_title.append(i.text)

    df=pd.DataFrame({'job_title':job_title})

    csv_file=df.to_csv()

    return csv_file

Now lets use our function above:

Code:

url='https://www.careerpointkenya.co.ke/jobs/'

my_file=aronscrapper(url)

print(my_file)

   

The output is a csv file with the name my_file

You can see from the output below I have obtained my information with a simple line of code. 

Output:

,job_title

0,Marketing and Branding Executive Job Acen Tria Group

1,Assistant Office Administrator Job NTSA

2,Assistant Supply Chain Management Officer Job NTSA

3,Water Treatment Works Operator Job TAVEVO Water

4,Pump Attendant Job TAVEVO

5,Pipe Fitters Job TAVEVO

6,Loan Officers Job Cherehani Africa-Kakamega

7,Loan Officers Job Cherehani Africa-Busia

8,Loan Officers Job Cherehani Africa-Migori

9,Primary School Teacher Internships TSC(2000 Posts)

10,"Junior Secondary School Teacher Internships TSC(18,000 Posts)"

11,Senior HR Management Officer Job Kericho County

12,Accounts Assistant Job Co-op Bank

13,Internal Auditor Job Co-op Bank

14,Credit Officer Job Co-op Bank

 

 

N/B You can use the url of your choice to scrape a website ,and use the technique I have shown you to input the html tag that you want based on the information you want from a website.  

Comments

  1. Thanks for the info I didn't know I can collect data from websites.

    ReplyDelete
    Replies
    1. Now you know for more info contact +254746494596

      Delete

Post a Comment

Popular posts from this blog

IMPACTS AND A BRIEF HISTORY OF AI.

Most famous Statisticians of all time.