Scraping là một kỹ năng rất cần thiết cho mọi người để lấy dữ liệu từ bất kỳ trang web nào. Cạo và phân tích cú pháp một bảng có thể là công việc rất tẻ nhạt nếu chúng ta sử dụng trình phân tích cú pháp Beautiful soup tiêu chuẩn để làm việc đó. Do đó, ở đây chúng tôi sẽ mô tả một thư viện với sự trợ giúp của bất kỳ bảng nào có thể được lấy từ bất kỳ trang web nào một cách dễ dàng. Với phương pháp này, bạn thậm chí không phải kiểm tra phần tử của trang web, bạn chỉ cần cung cấp URL của trang web. Thế là xong và công việc sẽ hoàn thành trong vài giây
Cài đặt
Bạn có thể sử dụng pip để cài đặt thư viện này
pip install html-table-parser-python3
Bắt đầu
Bước 1. Nhập các thư viện cần thiết cần thiết cho tác vụ
# Library for opening url and creating # requests import urllib.request # pretty-print python data structures from pprint import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser # for converting the parsed data in a # pandas dataframe import pandas as pd
Bước 2. Định nghĩa một chức năng để lấy nội dung của trang web
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]
Bây giờ, chức năng của chúng tôi đã sẵn sàng, vì vậy chúng tôi phải chỉ định url của trang web mà chúng tôi cần phân tích bảng
Ghi chú. Ở đây chúng ta sẽ lấy ví dụ về moneycontrol. com vì nó có nhiều bảng và sẽ giúp bạn hiểu rõ hơn. Bạn có thể xem trang web tại đây.
Bước 3. Bảng phân tích cú pháp
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]
Mỗi hàng của bảng được lưu trữ trong một mảng. Điều này có thể được chuyển đổi thành khung dữ liệu gấu trúc một cách dễ dàng và có thể được sử dụng để thực hiện bất kỳ phân tích nào.
Hoàn thành mã
Python3
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]4
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]5
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]6
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]7
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]8
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]9
# Library for opening url and creating # requests import urllib.request # pretty-print python data structures from pprint import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser # for converting the parsed data in a # pandas dataframe import pandas as pd0
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]6
# Library for opening url and creating # requests import urllib.request # pretty-print python data structures from pprint import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser # for converting the parsed data in a # pandas dataframe import pandas as pd0
# Library for opening url and creating # requests import urllib.request # pretty-print python data structures from pprint import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser # for converting the parsed data in a # pandas dataframe import pandas as pd3
# Library for opening url and creating # requests import urllib.request # pretty-print python data structures from pprint import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser # for converting the parsed data in a # pandas dataframe import pandas as pd4
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]9
# Library for opening url and creating # requests import urllib.request # pretty-print python data structures from pprint import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser # for converting the parsed data in a # pandas dataframe import pandas as pd6
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]6
# Library for opening url and creating # requests import urllib.request # pretty-print python data structures from pprint import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser # for converting the parsed data in a # pandas dataframe import pandas as pd8
# Library for opening url and creating # requests import urllib.request # pretty-print python data structures from pprint import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser # for converting the parsed data in a # pandas dataframe import pandas as pd9
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]0
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]6
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]2
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]3
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]4
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]5
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]6
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]7____23
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]7____24
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]7____32
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]7______34
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]5
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]6
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]5
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]8
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]7____340
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]5
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]42
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]7____344
# Opens a website and read its # binary contents [HTTP Response Body] def url_get_contents[url]: # Opens a website and read its # binary contents [HTTP Response Body] #making request to the website req = urllib.request.Request[url=url] f = urllib.request.urlopen[req] #reading contents of the website return f.read[]7____346
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]47
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]48
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]49
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]5
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]51
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]52
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]52
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]54
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]52
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]56
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]52
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]58
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]52
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]60
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]52
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]62
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]52
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]64
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]65
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]66
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]67
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]68
# defining the html contents of a URL. xhtml = url_get_contents['Link'].decode['utf-8'] # Defining the HTMLTableParser object p = HTMLTableParser[] # feeding the html contents in the # HTMLTableParser object p.feed[xhtml] # Now finally obtaining the data of # the table required pprint[p.tables[1]]69