Online data collection using python
Data collection from websites
Article by wanga harron.
One of the
core skills for a person working with data is data collection. As a data
scientist /statistician/data analyst/data engineer can’t do anything
constructive without data, this data must be found hence data collection. Data
can be collected in many ways depending on data type and purpose of collection. Here
I will look at data collection from a websites using relevant technology.
Not long ago when organizations wanted
to collect information from websites, they had to hire a group of people to do
the collection, people hired had to copy paste a lot of pages this was
tiresome and not economical . Until a python library called beautiful
soup was invented. This made it easy to scrape(collect data from a website) a lot of pages in websites with
ease. Beautiful soup also inspired the creation of R package called rvest which
is used for web scraping. Another library in python used for scraping is
selenium which can build a bot that can scrape websites with ease. For a person
to use the packages listed above they must be good in programming both
python and R. If programming isn’t your cup of tea there are some API tools
that can be used for scraping websites at a cost this APIS scrape websites and saves the data
in terms of json files, remember its easy to work with json files as they can
be easily converted to pandas’ data frame for your own analysis.
Let’s look
at web scraping in python using beautiful soup package, before we dive into
this let me highlight some important information here, some websites are protective
of their data and they won’t allow for web scraping make sure you read websites
privacy policy before trying to scrape it, secondly before trying to scrape please
be respectful don’t just scrape continuously this may affect the functionality of
the website. Let’s get the ball rolling now open chrome or any other browser of
your choice, open the website of your choice for me I want to get all job titles
from Jobs in Kenya - Career Point Kenya a website in Kenya for job advertisement.
Here is the
page outlook:
. After opening the website right click on the information you want to scrape the image below shows it
Now click
on inspect and the layout below should appear
From the
screenshot we can see the html tags on the right side of the page as I said
earlier I want to scrape all the job titles the titles lie under the <h2> tag.
Now
headover to python environment any ide can work for web scraping for me am
going to use jupyter notebook, create a virtual environment first here is the
article on how to create a virtual environment in windows www.statsguru.com.
After creating
the virtual environment install the following packages:
Code:
!pip
install pandas
!pip
install beautifulsoup4
Now lets import required packages
Code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
Now let’s build a function called aronscrapper(you can
give the function any name of your choice) that will scrape the website above and create a data frame then convert the data to
csv format which can then be opened in excel to enable you perform your own
analysis
Code:
def aronscrapper(url):
request=requests.get(url)
soup=BeautifulSoup(request.text,'html.parser')
job_title=[]
for i
in soup.find_all('h2'):
job_title.append(i.text)
df=pd.DataFrame({'job_title':job_title})
csv_file=df.to_csv()
return
csv_file
Now lets use our function above:
Code:
url='https://www.careerpointkenya.co.ke/jobs/'
my_file=aronscrapper(url)
print(my_file)
The output is a csv file with the name my_file
You can see from the output below I have obtained my information with a simple line of code.
Output:
,job_title
0,Marketing and Branding Executive Job Acen Tria
Group
1,Assistant Office Administrator Job NTSA
2,Assistant Supply Chain Management Officer Job
NTSA
3,Water Treatment Works Operator Job TAVEVO Water
4,Pump Attendant Job TAVEVO
5,Pipe Fitters Job TAVEVO
6,Loan Officers Job Cherehani Africa-Kakamega
7,Loan Officers Job Cherehani Africa-Busia
8,Loan Officers Job Cherehani Africa-Migori
9,Primary School Teacher Internships TSC(2000
Posts)
10,"Junior Secondary School Teacher
Internships TSC(18,000 Posts)"
11,Senior HR Management Officer Job Kericho County
12,Accounts Assistant Job Co-op Bank
13,Internal Auditor Job Co-op Bank
14,Credit Officer Job Co-op Bank
N/B You can use the url of your choice to scrape a
website ,and use the technique I have shown you to input the html tag that you
want based on the information you want from a website.
Thanks for the info I didn't know I can collect data from websites.
ReplyDeleteNow you know for more info contact +254746494596
Delete