Bảng javascript cạo web python

Scraping là một kỹ năng rất cần thiết cho mọi người để lấy dữ liệu từ bất kỳ trang web nào. Cạo và phân tích cú pháp một bảng có thể là công việc rất tẻ nhạt nếu chúng ta sử dụng trình phân tích cú pháp Beautiful soup tiêu chuẩn để làm việc đó. Do đó, ở đây chúng tôi sẽ mô tả một thư viện với sự trợ giúp của bất kỳ bảng nào có thể được lấy từ bất kỳ trang web nào một cách dễ dàng. Với phương pháp này, bạn thậm chí không phải kiểm tra phần tử của trang web, bạn chỉ cần cung cấp URL của trang web. Thế là xong và công việc sẽ hoàn thành trong vài giây

Cài đặt

Bạn có thể sử dụng pip để cài đặt thư viện này

pip install html-table-parser-python3

Bắt đầu

Bước 1. Nhập các thư viện cần thiết cần thiết cho tác vụ

# Library for opening url and creating 
# requests
import urllib.request

# pretty-print python data structures
from pprint import pprint

# for parsing all the tables present 
# on the website
from html_table_parser.parser import HTMLTableParser

# for converting the parsed data in a
# pandas dataframe
import pandas as pd

Bước 2. Định nghĩa một chức năng để lấy nội dung của trang web

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]

Bây giờ, chức năng của chúng tôi đã sẵn sàng, vì vậy chúng tôi phải chỉ định url của trang web mà chúng tôi cần phân tích bảng

Ghi chú. Ở đây chúng ta sẽ lấy ví dụ về moneycontrol. com vì nó có nhiều bảng và sẽ giúp bạn hiểu rõ hơn. Bạn có thể xem trang web tại đây.  

Bước 3. Bảng phân tích cú pháp

# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]

Mỗi hàng của bảng được lưu trữ trong một mảng. Điều này có thể được chuyển đổi thành khung dữ liệu gấu trúc một cách dễ dàng và có thể được sử dụng để thực hiện bất kỳ phân tích nào.  

Hoàn thành mã

Python3




# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
4

# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
5

# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
6
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
7

 

# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
8

# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
9
# Library for opening url and creating 
# requests
import urllib.request

# pretty-print python data structures
from pprint import pprint

# for parsing all the tables present 
# on the website
from html_table_parser.parser import HTMLTableParser

# for converting the parsed data in a
# pandas dataframe
import pandas as pd
0
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
6
# Library for opening url and creating 
# requests
import urllib.request

# pretty-print python data structures
from pprint import pprint

# for parsing all the tables present 
# on the website
from html_table_parser.parser import HTMLTableParser

# for converting the parsed data in a
# pandas dataframe
import pandas as pd
0

 

# Library for opening url and creating 
# requests
import urllib.request

# pretty-print python data structures
from pprint import pprint

# for parsing all the tables present 
# on the website
from html_table_parser.parser import HTMLTableParser

# for converting the parsed data in a
# pandas dataframe
import pandas as pd
3

# Library for opening url and creating 
# requests
import urllib.request

# pretty-print python data structures
from pprint import pprint

# for parsing all the tables present 
# on the website
from html_table_parser.parser import HTMLTableParser

# for converting the parsed data in a
# pandas dataframe
import pandas as pd
4

# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
9
# Library for opening url and creating 
# requests
import urllib.request

# pretty-print python data structures
from pprint import pprint

# for parsing all the tables present 
# on the website
from html_table_parser.parser import HTMLTableParser

# for converting the parsed data in a
# pandas dataframe
import pandas as pd
6
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
6
# Library for opening url and creating 
# requests
import urllib.request

# pretty-print python data structures
from pprint import pprint

# for parsing all the tables present 
# on the website
from html_table_parser.parser import HTMLTableParser

# for converting the parsed data in a
# pandas dataframe
import pandas as pd
8

 

# Library for opening url and creating 
# requests
import urllib.request

# pretty-print python data structures
from pprint import pprint

# for parsing all the tables present 
# on the website
from html_table_parser.parser import HTMLTableParser

# for converting the parsed data in a
# pandas dataframe
import pandas as pd
9

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
0

# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
6
# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
2

 

 

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
3

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
4

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
5
# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
6

 

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
7____23

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
7____24

 

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
7____32

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
7______34
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
5
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
6
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
5
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
8

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
7____340
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
5
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
42

 

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
7____344

# Opens a website and read its
# binary contents [HTTP Response Body]
def url_get_contents[url]:

    # Opens a website and read its
    # binary contents [HTTP Response Body]

    #making request to the website
    req = urllib.request.Request[url=url]
    f = urllib.request.urlopen[req]

    #reading contents of the website
    return f.read[]
7____346
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
47

 

# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
48

# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
49
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
5
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
51
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
52
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
52
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
54
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
52
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
56

# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
52
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
58
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
52
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
60
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
52
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
62
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
52
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
64
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
65
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
66
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
67
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
68
# defining the html contents of a URL.
xhtml = url_get_contents['Link'].decode['utf-8']

# Defining the HTMLTableParser object
p = HTMLTableParser[]

# feeding the html contents in the
# HTMLTableParser object
p.feed[xhtml]

# Now finally obtaining the data of
# the table required
pprint[p.tables[1]]
69

Làm cách nào để cạo bảng bằng JavaScript?

Tìm kiếm web bằng JavaScript và nút. .
Chuẩn bị tệp của chúng tôi. .
Kiểm tra trang đích bằng DevTools. .
Gửi yêu cầu HTTP của chúng tôi và phân tích cú pháp HTML thô. .
Lặp qua các hàng của bảng HTML. .
Đẩy dữ liệu đã cạo vào một mảng trống. .
Gửi dữ liệu đã cạo vào tệp CSV. .
Trình quét bảng HTML [Mã đầy đủ]

JS có tốt cho việc quét web không?

Bạn có thể sử dụng JavaScript để quét web nếu muốn quét các trang web yêu cầu nhiều JavaScript để hoạt động chính xác . Để quét các trang web như vậy, bạn sẽ cần sử dụng cái được gọi là "trình duyệt không đầu", nghĩa là một trình duyệt web thực sự sẽ tìm nạp và hiển thị trang web cho bạn.

Quét web bằng Python có hợp pháp không?

Không chia sẻ bất hợp pháp nội dung đã tải xuống. Việc thu thập dữ liệu cho mục đích cá nhân thường được chấp nhận, ngay cả khi đó là thông tin có bản quyền, vì nó có thể thuộc điều khoản sử dụng hợp lý của luật sở hữu trí tuệ . Tuy nhiên, chia sẻ dữ liệu mà bạn không có quyền chia sẻ là bất hợp pháp. Chia sẻ những gì bạn có thể.

Là web cạo chống lại TOS?

Tin vui cho các nhà lưu trữ, học giả, nhà nghiên cứu và nhà báo. Cạo dữ liệu có thể truy cập công khai là hợp pháp, theo U. S. phán quyết của tòa phúc thẩm

Chủ Đề