Михаил Сисин Со-основатель облачного сервиса по сбору информации и парсингу сайтов Diggernaut. Работает в области сбора и анализа данных, а также разработки систем искусственного интеллекта и машинного обучения  более десяти лет.

Парсим данные о товарах в магазине Ben Meadows

Парсим данные о товарах в магазине Ben Meadows

Компания Ben Meadows была основана в 1956 году с одной лишь целью: снабжать лесников качественной продукцией для их профессиональной деятельности. Вскоре компания добавила в свой ассортимент геодезическое и противопожарное оборудование, а также оборудование для картографов. И вот уже более 60 лет Ben Meadows снабжает профессионалов лучшим оборудованием и предоставляет превосходный сервис. Этим парсером вы сможете извлечь информацию о товарах в интернет-магазине benmeadows.com.

Примерное количество товаров: 4000
Примерное количество запросов: 4000
Рекомендуемый план подписки: Free

ВНИМАНИЕ! Количество запросов может превышать количество товаров, потому что данные о вариациях, изображениях и др. могут парсится используя запросы к дополнительным ресурсам. Также часть данных о товаре может доставляться с помощью XHR запросов, что также увеличивает общее количество необходимых запросов.

Для его использования вы должны иметь учетную запись в нашем сервисе Diggernaut.

  1. Пройдите по этой ссылке для регистрации в сервисе Diggernaut
  2. После регистрации и подтверждения email адреса войдите в свою учетную запись
  3. Создайте проект с любый именем и описанием, если вы не знаете как, обратитесь к нашей документации
  4. Войдите во вновь созданный проект и создайте в нем диггер с любым именем, если вы не знаете как, обратитесь к нашей документации
  5. Скопируйте в буфер обмена приведенный ниже сценарий диггера и вставьте его в созданный вами диггер, если вы не знаете как, обратитесь к нашей документации
  6. Переключите режим работы диггера с Debug на Active, если вы не знаете как, обратитесь к нашей документации
  7. Запустите ваш диггер и дождитесь окончания его работы, если вы не знаете как, обратитесь к нашей документации
  8. Скачайте собранный набор данных в нужном вам формате, если вы не знаете как, обратитесь к нашей документации

В дальнейшем вы можете установить расписание для запуска вашего парсера и забирать информацию регулярно.

Сценарий парсера:

---
config:
    debug: 2
    agent: Firefox
do:
- walk:
    to: http://www.benmeadows.com
    do:
    - find:
        path: 'ul#topnav>li:has(a#productCategories)>div.subMenu a'
        do:
        - parse:
            attr: href
        - space_dedupe
        - trim
        - if:
            match: \w+
            do:
            - normalize:
                routine: url
            - link_add:
                pool: catalog
- walk:
    to: links
    pool: catalog
    do:
    - find: 
        path: .viewPaginationNext 
        do: 
        - parse:
            attr: href
        - if:
            match: \w+
            do:
            - normalize:
                routine: url
            - link_add:
                pool: catalog
            
    - find:
        path: 'a#hlNavigation'
        do:
        - parse:
            attr: href
        - space_dedupe
        - trim
        - if:
            match: \w+
            do:
            - normalize:
                routine: url
            - link_add:
                pool: catalog
    - find:
        path: 'a#hladd'
        do:
        - parse:
            attr: href
        - space_dedupe
        - trim
        - if:
            match: \w+
            do:
            - normalize:
                routine: url
            - link_add:
                pool: pages
- walk:
    to: links
    pool: pages
    do:
    - sleep: 2
    - find:
        path: 'div#prodWrap'
        do:
        - object_new: product
        - eval:
            routine: js
            body: '(function (){var d = new Date(); return d.toISOString()})();'
        - object_field_set:
            object: product
            field: date
        - static_get: url
        - object_field_set:
            object: product
            field: url
        - find:
            path: meta[itemprop="identifier"]
            do:
            - parse:
                attr: content
            - space_dedupe
            - trim
            - if:
                match: \d+
                do:
                - object_field_set:
                    object: product
                    field: sku
        - find:
            path: 'span#lblGroupTitle'
            do:
            - parse
            - space_dedupe
            - trim
            - object_field_set:
                object: product
                field: name
        - find:
            path: 'a#imgLink'
            do:
            - parse:
                attr: href
            - space_dedupe
            - trim
            - if:
                match: \w+
                do:
                - normalize:
                    routine: url
                - object_field_set:
                    object: product
                    joinby: "|"
                    field: images
        - find:
            path: script:contains('loadProductPageDropDowns')
            do:
            - parse:
                filter: loadProductPageDropDowns\((.+)\)\;\$\('\#txtHeaderSearch'\)\.focus\(\)\;\}\)\;
            - normalize:
                routine: json2xml
            - to_block
            - find:
                path: body_safe>groupname
                do:
                - parse
                - space_dedupe
                - trim
                - object_field_set:
                    object: product
                    field: name
            - find:
                path: body_safe>groupid
                do:
                - parse
                - space_dedupe
                - trim
                - if:
                    match: \d+
                    do:
                    - object_field_set:
                        object: product
                        field: sku
            - find:
                path: largeimage,secimages>large
                do:
                - parse
                - space_dedupe
                - trim
                - if:
                    match: \w+
                    do:
                    - normalize:
                        routine: url
                    - object_field_set:
                        object: product
                        joinby: "|"
                        field: images
            - find:
                path: properties>children
                do:
                - variable_clear: sort
                - variable_clear: value
                - find:
                    path: sort
                    do:
                    - parse
                    - space_dedupe
                    - trim
                    - variable_set: sort
                - find:
                    path: value
                    do:
                    - parse
                    - space_dedupe
                    - trim
                    - variable_set: value
                - variable_get: sort
                - if:
                    match: \w+
                    do:
                    - variable_get: value
                    - if:
                        match: \w+
                        do:
                        - register_set: "<%sort%>: <%value%>"
                        - object_field_set:
                            object: product
                            joinby: "|"
                            field: variations
        - register_set: Ben Meadows
        - variable_set: brand
        - find:
            path: 'img[itemprop="brand"]'
            do:
            - parse:
                attr; content
            - space_dedupe
            - trim
            - variable_set: brand
        - variable_get: brand
        - object_field_set:
            object: product
            field: brand
        - find:
            path: span.currentCrumb>a
            slice: 0:-2
            do:
            - parse
            - space_dedupe
            - trim
            - if:
                match: \w+
                do:
                - object_field_set:
                    object: product
                    joinby: "|"
                    field: category
        - find:
            in: doc
            path: meta[name="description"]
            do:
            - parse:
                attr: content
            - space_dedupe
            - trim
            - variable_set: desc
        - find:
            path: 'div#prodDetailedBenefit>div.proDesc'
            do:
            - parse
            - space_dedupe
            - trim
            - variable_set: desc
        - variable_get: desc
        - object_field_set:
            object: product
            field: description
        - find:
            path: meta[itemprop="price"],meta[itemprop="lowPrice"]
            do:
            - parse:
                attr: content
                filter: ([0-9\.\,]+)
            - normalize:
                routine: replace_substring
                args:
                    \,: ''
            - space_dedupe
            - trim
            - object_field_set:
                object: product
                type: float
                field: price
        - find:
            path: meta[itemprop="currency"]
            do:
            - parse:
                attr: content
            - object_field_set:
                object: product
                field: currency
        - object_save:
            name: product

Ниже приведен пример датасета с несколькими товарами в формате JSON (для наглядности). Датасет может быть скачан и как CSV, XLSX, XML, и любой другой текстовый формат используя темплейтный подход.

[{
    "product": {
        "brand": "Ben Meadows",
        "category": "Forestry Supplies and Equipment|Logging and Clearing Tools|Cable Pullers and Log Chains",
        "currency": "USD",
        "date": "2017-12-07T01:35:16.184Z",
        "description": "The compact design of this Swaged Wire Rope makes it up to 26% stronger than standard winch lines of the same diameter. The outer wires have a larger surface area than standard winch lines, providing better resistance to wear and tear. The already compact line is also resistant to abrasion, pig tailing, kinking and drum crushing. The 6 x 26 IWRC construction features a stainless steel duplex sleeve that maximizes sleeve to wire rope contact and a strong alloy Hook and Latch. Design Factor: 3.55:1 Ratio. NOTE: Match to your existing wire rope size or check your winch manufacturer's wire rope size recommendation before ordering.",
        "images": "https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_03.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_03.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_03.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_03.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_03.jpg|https://www.benmeadows.com/images/ir/s7product/36MX08_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MX08_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MX08_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MX08_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MX08_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MX08_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MW89_03.jpg|https://www.benmeadows.com/images/ir/s7product/36MX08_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MX08_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MX08_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/36MX08_AS01.jpg",
        "name": "B/A Products Swaged IWRC Wire Rope with Hook and Latch",
        "price": 89.95,
        "sku": "37346391",
        "url": "https://www.benmeadows.com/ba-products-swaged-iwrc-wire-rope-with-hook-and-latch_37346391/",
        "variations": "Diameter: 3/8''|Length: 100'|Diameter: 3/8''|Length: 75'|Diameter: 3/8''|Length: 56'|Diameter: 3/8''|Length: 35'|Diameter: 3/8''|Length: 50'|Diameter: 1/2''|Length: 100'|Diameter: 7/16''|Length: 150'|Diameter: 7/16''|Length: 75'|Diameter: 3/8''|Length: 150'|Diameter: 1/2''|Length: 150'|Diameter: 7/16''|Length: 50'"
    }
}
,{
    "product": {
        "brand": "Ben Meadows",
        "category": "Forestry Supplies and Equipment|Logging and Clearing Tools|Cable Pullers and Log Chains",
        "currency": "USD",
        "date": "2017-12-07T01:35:19.635Z",
        "description": "Tough and durable, these lifting/pulling devices are designed for heavy-duty jobs. Puller handle bends as a safety warning when it is overloaded. The frame and pawls are made of ductile iron and the yoke is malleable iron. The frame hook and end hook are forged steel and tackle block hook is a steel casting. Each Puller comes with 5вЃ„16\" wire cable on a welded ductile iron reel (cast iron reel on No. 210016). Dimensions: 8\"H x 6\"W x 17\"L.Heavy-duty model Pullers are available with two or three-ton capacity with double line. Two-ton model comes with choice of cable length.",
        "images": "https://www.benmeadows.com/images/ir/s7product/8C839_AS04.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AS04.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AS07.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AS07.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_web01.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AA01.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AS06.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AS06.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AS05.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_web01.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AS07.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AS07.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_web01.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AS07.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_AS07.jpg|https://www.benmeadows.com/images/ir/s7product/8C839_web01.jpg",
        "name": "The More Power Puller® Puller",
        "price": 225.99,
        "sku": "36810185",
        "url": "https://www.benmeadows.com/the-more-power-puller-puller_36810185/",
        "variations": "Capacity: 2 Ton|Length: 35'|Capacity: 3 Ton|Length: 20'|Capacity: 2 Ton|Length: 30'|Capacity: 2 Ton|Length: 20'"
    }
}
,{
    "product": {
        "brand": "Ben Meadows",
        "category": "Forestry Supplies and Equipment|Logging and Clearing Tools|Cable Pullers and Log Chains",
        "currency": "USD",
        "date": "2017-12-07T01:35:22.272Z",
        "description": "Make hauling and securing heavy loads a little easier with these Cable Pullers. Galvanized aircraft-quality cable is virtually indestructible.One-piece aluminum-alloy ratchet wheels resist wear and last longer than laminated wheels. Electro-plated parts protect against rust. Drop-forged steel slip hooks rotate a full 360В°. Notch-at-a-time letdown makes for trouble-free, positive-control lowering and releasing.1-Ton Cable Puller is the \"original.\" 3вЃ„16\"-dia. cable. 15:1 leverage. 12' max. lift.2-Ton Cable Puller adds more leverage (30:1) and a pulley for heavier loads. 3вЃ„16\"-dia. cable. 6' max. lift.3-Ton Cable Puller has 5вЃ„16\"-dia. cable with 35:1 leverage for lifting your heaviest loads. 12' max. lift.",
        "images": "https://www.benmeadows.com/images/ir/s7product/8CJJ7_AS03.jpg|https://www.benmeadows.com/images/ir/s7product/8CJJ7_AS03.jpg|https://www.benmeadows.com/images/ir/s7product/8YAA0_web01.jpg|https://www.benmeadows.com/images/ir/s7product/8YAA0_web01.jpg|https://www.benmeadows.com/images/ir/s7product/8CJJ7_web01.jpg|https://www.benmeadows.com/images/ir/s7product/8CJJ7_web01.jpg|https://www.benmeadows.com/images/ir/s7product/8CJJ8_web01.jpg|https://www.benmeadows.com/images/ir/s7product/8CJJ8_web01.jpg",
        "name": "MAASDAM POW'R-PULL® Cable Pullers",
        "price": 40.99,
        "sku": "36810155",
        "url": "https://www.benmeadows.com/maasdam-powr-pull-cable-pullers_36810155/",
        "variations": "Capacity: 2 Ton|Capacity: 1 Ton|Capacity: 3 Ton"
    }
}
,{
    "product": {
        "brand": "Ben Meadows",
        "category": "Forestry Supplies and Equipment|Logging and Clearing Tools|Cable Pullers and Log Chains",
        "currency": "USD",
        "date": "2017-12-07T01:35:24.906Z",
        "description": "Class A, Grade 1 castings are completely self-contained. Handyman Jack allows you to lift load on down stroke of handle. Safety shear pin protects load from being dropped. Steel is standard rolled for strength and rigidity; reversible for extra wear. 4\"L lifting nose allows pickup as close as 4-1/2\" from bottom of 28 sq. in. base plate. Adjustable for clamping use. 4660-lb. capacity. Accessories are also available.Loc-RacВ® is a mounting and locking device that transports your jack securely in a pickup truck or utility vehicle. Includes lock and keys.Bumper Lift is designed to fit most vehicle bumpers.",
        "images": "https://www.benmeadows.com/images/ir/s7product/8C978_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/8C978_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/8C978_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/8C978_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/8C978_AS01.jpg|https://www.benmeadows.com/images/ir/s7product/8C978_AS01.jpg",
        "name": "Handyman® Jacks",
        "price": 86.99,
        "sku": "36810133",
        "url": "https://www.benmeadows.com/handyman-jacks_36810133/",
        "variations": "Height: 60''|Height: 48''"
    }
}]
Михаил Сисин Со-основатель облачного сервиса по сбору информации и парсингу сайтов Diggernaut. Работает в области сбора и анализа данных, а также разработки систем искусственного интеллекта и машинного обучения  более десяти лет.

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *

Этот сайт защищен reCAPTCHA и применяются Политика конфиденциальности и Условия обслуживания применять.

Срок проверки reCAPTCHA истек. Перезагрузите страницу.