Михаил Сисин Со-основатель облачного сервиса по сбору информации и парсингу сайтов Diggernaut. Работает в области сбора и анализа данных, а также разработки систем искусственного интеллекта и машинного обучения  более десяти лет.

Парсер товаров онлайн магазина Anthropologie.com

Парсер товаров онлайн магазина Anthropologie.com

Anthropologie — американский ретейлер одежды. В настоящее время компания управляет более чем 200 магазинами по всему миру и предлагает тщательно отобранный ассортимент одежды, украшений, нижнего белья, товаров для дома и декора, товаров для красоты и подарков. В августе 1992 года Ричарду Хейну пришла в голову идея открыть магазин одежды для креативных и образованных женщин в возрасте 30-45 лет, так появился магазин Anthropologie. Этот парсер товаров онлайн магазина предназначен для сбора информации о товарах, представленных на сайте магазина anthropologie.com.

Примерное количество товаров: 50000
Примерное количество запросов: 50000
Рекомендуемый план подписки: Small

ВНИМАНИЕ! Количество запросов может превышать количество товаров, потому что данные о вариациях, изображениях и др. могут парсится используя запросы к дополнительным ресурсам. Также часть данных о товаре может доставляться с помощью XHR запросов, что также увеличивает общее количество необходимых запросов.

Для его использования вы должны иметь учетную запись в нашем сервисе Diggernaut.

  1. Пройдите по этой ссылке для регистрации в сервисе Diggernaut
  2. После регистрации и подтверждения email адреса войдите в свою учетную запись
  3. Создайте проект с любый именем и описанием, если вы не знаете как, обратитесь к нашей документации
  4. Войдите во вновь созданный проект и создайте в нем диггер с любым именем, если вы не знаете как, обратитесь к нашей документации
  5. Скопируйте в буфер обмена приведенный ниже сценарий диггера и вставьте его в созданный вами диггер, если вы не знаете как, обратитесь к нашей документации
  6. Переключите режим работы диггера с Debug на Active, если вы не знаете как, обратитесь к нашей документации
  7. Запустите ваш диггер и дождитесь окончания его работы, если вы не знаете как, обратитесь к нашей документации
  8. Скачайте собранный набор данных в нужном вам формате, если вы не знаете как, обратитесь к нашей документации

В дальнейшем вы можете установить расписание для запуска вашего парсера и забирать информацию регулярно.

Сценарий парсера:

---
config:
    debug: 2
    agent: Firefox
do:
- walk:
    to: https://www.anthropologie.com
    do:
    - find: 
        path: .c-main-navigation__li--level-1 
        do: 
        - find: 
            path: span
            slice: 0
            do: 
            - parse
            - space_dedupe
            - trim
            - normalize:
                routine: lower
            - variable_set: cat1
        - find: 
            path: .c-main-navigation__li--level-2 
            do: 
            - variable_clear: subcat
            - find: 
                path: .c-main-navigation__a--level-2
                do: 
                - parse
                - space_dedupe
                - trim
                - normalize:
                    routine: lower
                - variable_set: cat2
            - find: 
                path: .c-main-navigation__li--level-3 a 
                do: 
                - parse
                - space_dedupe
                - trim
                - normalize:
                    routine: lower
                - variable_set: cat3
                - variable_set:
                    field: subcat
                    value: 1
                - parse:
                    attr: href
                - pool_clear: main
                - link_add:
                    pool: main
                - walk:
                    to: links
                    pool: main
                    do:
                    - find: 
                        path: .js-pagination__arrow--next
                        slice: 0
                        do: 
                        - parse:
                            attr: href
                        - link_add:
                            pool: main
                    - find: 
                        path: .c-product-tile__image-link 
                        do: 
                        - parse:
                            attr: href
                            filter:
                                - (.+)\?
                                - (.+)
                        - normalize:
                            routine: url
                        - walk:
                            to: value
                            do:
                            - find: 
                                path: body
                                do: 
                                - object_new: product
                                - eval:
                                    routine: js
                                    body: '(function (){var d = new Date(); return d.toISOString()})();'
                                - object_field_set:
                                    object: product
                                    field: date
                                - register_set: Anthropologie
                                - object_field_set:
                                    object: product
                                    field: brand
                                - static_get: url
                                - object_field_set:
                                    object: product
                                    field: url
                                - find: 
                                    path: meta[property="product:price:amount"] 
                                    do: 
                                    - parse:
                                        attr: content
                                    - if:
                                        match: (\d)
                                        do:
                                        - object_field_set:
                                            object: product
                                            field: price
                                            type: float
                                        - register_set: USD
                                        - object_field_set:
                                            object: product
                                            field: currency
                                - find: 
                                    path: .o-carousel__flex-wrapper > img.c-product-image 
                                    do: 
                                    - parse:
                                        attr: src
                                        filter:
                                            - (.+)\?
                                            - (.+)
                                    - normalize:
                                        routine: url
                                    - object_field_set:
                                        object: product
                                        field: images
                                        joinby: "|"
                                    
                                - find: 
                                    path: script:matches(window\.productData) 
                                    do: 
                                    - parse:
                                        filter:
                                            - window.productData\s*=\s*\'\s*(.+)\s*\'\s*;
                                    - normalize:
                                        routine: Base64ZLIBDecode
                                    - normalize:
                                        routine: json2xml
                                    - to_block
                                    - find: 
                                        path: body_safe 
                                        do: 
                                        - find: 
                                            path: primaryslice:hasChild(displaylabel:matches(Color)) 
                                            do: 
                                            - find: 
                                                path: sliceitems > displayname
                                                do: 
                                                - parse
                                                - space_dedupe
                                                - trim
                                                - object_field_set:
                                                    object: product
                                                    field: variations
                                                    joinby: "|"
                                            - find: 
                                                path: sliceitems
                                                do: 
                                                - variable_clear: iid
                                                
                                                - find: 
                                                    path:  id
                                                    slice: 0
                                                    do: 
                                                    - parse
                                                    - variable_set: iid
                                                - find: 
                                                    path: images
                                                    do: 
                                                    - parse
                                                    - register_set: http://images.anthropologie.com/is/image/Anthropologie/<%iid%>_<%register%>
                                                    - object_field_set:
                                                        object: product
                                                        field: images
                                                        joinby: "|"
                                                    
                                                
                                        - find: 
                                            path: product > stylenumber
                                            slice: 0
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: sku
                                        - find: 
                                            path: product > product > brand
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: brand
                                        - find: 
                                            path: product > product > displayname
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: name
                                        - find: 
                                            path: product > product > longdescription 
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: description
                                - variable_get: cat1
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - variable_get: cat2
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - variable_get: cat3
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - object_save:
                                    name: product
            - variable_get: subcat
            - if:
                match: (1)
                else:
                - find: 
                    path: .c-main-navigation__a--level-2
                    do: 
                    - parse:
                        attr: href
                    - pool_clear: main
                    - link_add:
                        pool: main
                    - walk:
                        to: links
                        pool: main
                        do:
                        - find: 
                            path: .js-pagination__arrow--next
                            slice: 0
                            do: 
                            - parse:
                                attr: href
                            - link_add:
                                pool: main
                        - find: 
                            path: .c-product-tile__image-link 
                            do: 
                            - parse:
                                attr: href
                                filter:
                                    - (.+)\?
                                    - (.+)
                            - normalize:
                                routine: url
                            - walk:
                                to: value
                                do:
                                - find: 
                                    path: body
                                    do: 
                                - object_new: product
                                - eval:
                                    routine: js
                                    body: '(function (){var d = new Date(); return d.toISOString()})();'
                                - object_field_set:
                                    object: product
                                    field: date
                                - register_set: Anthropologie
                                - object_field_set:
                                    object: product
                                    field: brand
                                - static_get: url
                                - object_field_set:
                                    object: product
                                    field: url
                                - find: 
                                    path: meta[property="product:price:amount"] 
                                    do: 
                                    - parse:
                                        attr: content
                                    - if:
                                        match: (\d)
                                        do:
                                        - object_field_set:
                                            object: product
                                            field: price
                                            type: float
                                        - register_set: USD
                                        - object_field_set:
                                            object: product
                                            field: currency
                                - find: 
                                    path: .o-carousel__flex-wrapper > img.c-product-image 
                                    do: 
                                    - parse:
                                        attr: src
                                        filter:
                                            - (.+)\?
                                            - (.+)
                                    - normalize:
                                        routine: url
                                    - object_field_set:
                                        object: product
                                        field: images
                                        joinby: "|"
                                    
                                - find: 
                                    path: script:matches(window\.productData) 
                                    do: 
                                    - parse:
                                        filter:
                                            - window.productData\s*=\s*\'\s*(.+)\s*\'\s*;
                                    - normalize:
                                        routine: Base64ZLIBDecode
                                    - normalize:
                                        routine: json2xml
                                    - to_block
                                    - find: 
                                        path: body_safe 
                                        do: 
                                        - find: 
                                            path: primaryslice:hasChild(displaylabel:matches(Color)) 
                                            do: 
                                            - find: 
                                                path: sliceitems > displayname
                                                do: 
                                                - parse
                                                - space_dedupe
                                                - trim
                                                - object_field_set:
                                                    object: product
                                                    field: variations
                                                    joinby: "|"
                                            - find: 
                                                path: sliceitems
                                                do: 
                                                - variable_clear: iid
                                                
                                                - find: 
                                                    path:  id
                                                    slice: 0
                                                    do: 
                                                    - parse
                                                    - variable_set: iid
                                                - find: 
                                                    path: images
                                                    do: 
                                                    - parse
                                                    - register_set: http://images.anthropologie.com/is/image/Anthropologie/<%iid%>_<%register%>
                                                    - object_field_set:
                                                        object: product
                                                        field: images
                                                        joinby: "|"
                                                    
                                                
                                        - find: 
                                            path: product > stylenumber
                                            slice: 0
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: sku
                                        - find: 
                                            path: product > product > brand
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: brand
                                        - find: 
                                            path: product > product > displayname
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: name
                                        - find: 
                                            path: product > product > longdescription 
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: description
                                - variable_get: cat1
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - variable_get: cat2
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - variable_get: cat3
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - object_save:
                                    name: product

Ниже приведен пример датасета с несколькими товарами в формате JSON (для наглядности). Датасет может быть скачан и как CSV, XLSX, XML, и любой другой текстовый формат используя темплейтный подход.

[{
    "product": {
        "brand": "Illume",
        "category": "gifts|features|the gift guide",
        "date": "2017-12-05T21:15:58.241Z",
        "description": "New from the fragrance masters at Illume, Anatomy of a Fragrance bath and beauty products are sophisticated, lighthearted luxuries. Each is crafted in Minnesota, where Illume combines their signature scents with beautiful packaging designed in-house. From lavish hand creams to triple-milled soaps to nature-inspired perfumes, their line is ready-made for gifting and indulging. **Honey Rose**: a warm, romantic scent with notes of lily of the valley, sandalwood and bergamot **Orchid Vanille**: a bright, fresh combination of orange blossom, jasmine, black currant and praline **Wildflower Bergamot**: A zesty blend of bergamot, lemon and mango layered with cedar and sandalwood",
        "images": "https://images.anthropologie.com/is/image/Anthropologie/44448363_040_b|http://images.anthropologie.com/is/image/Anthropologie/44448363_040_b|http://images.anthropologie.com/is/image/Anthropologie/44448363_070_b|http://images.anthropologie.com/is/image/Anthropologie/44448363_065_b",
        "name": "Anatomy of a Fragrance Gift Set",
        "sku": "44448363",
        "url": "https://www.anthropologie.com/shop/anatomy-of-a-fragrance-gift-set",
        "variations": "Wildflower Bergamot|Orchid Vanille|Honey Rose"
    }
}
,{
    "product": {
        "brand": "Capri Blue",
        "category": "gifts|features|the gift guide",
        "date": "2017-12-05T21:15:59.713Z",
        "description": "Capri Blue's iconic vessels and fragrances - proudly designed and poured in Mississippi - are a long-standing favorite at Anthropologie. The line pairs striking visuals with intoxicating scents to create beautifully aromatic products like soy-blended candles and vegan-formulated beauty care. **Volcano**: tropical fruits, sugared oranges, lemons and limes, redolent with lightly exotic mountain greens **Coastal**: notes of pineapple, verbena and coconut, accented by sparkling lemon, bergamot and grapefruit **Fir & Firewood**: a fruity, green aroma of apple, clove, fir, pine needle, white birch, cedar, vetiver and musk **Japanese Quince & Cedar**: aromatic cedar wood is embellished with sun-ripened cassis, sugared quince, accents of red currant and a splash of sparkling pomelo **Gardenia & Fig**: bright greens and fresh peach mingle with gardenia, rose, ylang ylang and coconut over a base of light musk **Cinnamon Toddy**: a mouthwatering medley of ripe apple, warm cinnamon, golden clove and grated nutmeg topped with notes of honey and maple **Spiced Cider**: nutmeg, clove and cinnamon are layered over fresh apple and juicy orange notes **Lagoon**: top notes of freesia, incense and tamarind blend over a musky base of cashmere, wood and vetiver **Grapefruit Neroli**: sun-kissed grapefruit, quince and tangerine over neroli, vanilla, orchid and currant",
        "images": "https://images.anthropologie.com/is/image/Anthropologie/19851559_033_b|https://images.anthropologie.com/is/image/Anthropologie/19851559_033_b10|http://images.anthropologie.com/is/image/Anthropologie/19851559_033_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_033_b10|http://images.anthropologie.com/is/image/Anthropologie/19851559_090_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_090_b10|http://images.anthropologie.com/is/image/Anthropologie/19851559_090_b15|http://images.anthropologie.com/is/image/Anthropologie/19851559_090_b16|http://images.anthropologie.com/is/image/Anthropologie/19851559_049_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_026_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_098_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_040_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_007_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_007_b2",
        "name": "Capri Blue Iridescent Jar Candle",
        "sku": "19851559",
        "url": "https://www.anthropologie.com/shop/capri-blue-iridescent-jar-candle8",
        "variations": "Fir and Firewood|Spiced Cider|Volcano|Spiced Cider|Fir and Firewood|Volcano|Volcano"
    }
}
,{
    "product": {
        "brand": "Anthropologie",
        "category": "gifts|features|the gift guide",
        "date": "2017-12-05T21:16:00.340Z",
        "images": "https://images.anthropologie.com/is/image/Anthropologie/39336862_001_b3|https://images.anthropologie.com/is/image/Anthropologie/39336862_001_b|https://images.anthropologie.com/is/image/Anthropologie/39336862_001_b2|https://images.anthropologie.com/is/image/Anthropologie/39336862_001_b14|http://images.anthropologie.com/is/image/Anthropologie/39336862_001_b3|http://images.anthropologie.com/is/image/Anthropologie/39336862_001_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_001_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_001_b14|http://images.anthropologie.com/is/image/Anthropologie/39336862_074_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_074_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_074_b3|http://images.anthropologie.com/is/image/Anthropologie/39336862_074_b14|http://images.anthropologie.com/is/image/Anthropologie/39336862_010_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_010_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_010_b15|http://images.anthropologie.com/is/image/Anthropologie/39336862_030_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_030_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_030_b15|http://images.anthropologie.com/is/image/Anthropologie/39336862_040_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_040_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_040_b3|http://images.anthropologie.com/is/image/Anthropologie/39336862_040_b14|http://images.anthropologie.com/is/image/Anthropologie/39336862_065_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_065_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_065_b3|http://images.anthropologie.com/is/image/Anthropologie/39336862_051_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_051_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_051_b10|http://images.anthropologie.com/is/image/Anthropologie/39336862_066_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_066_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_066_b10",
        "name": "Slivered Geode Coaster",
        "sku": "39336862",
        "url": "https://www.anthropologie.com/shop/geode-coaster",
        "variations": "Black Quartz|Dyed Citron|White Quartz|Adventurian|Dyed Blue|Dyed Magenta|Amethyst|Rose quartz"
    }
}
,{
    "product": {
        "brand": "Floreat",
        "category": "gifts|features|the gift guide",
        "date": "2017-12-05T21:16:01.211Z",
        "images": "https://images.anthropologie.com/is/image/Anthropologie/43663541_000_b|https://images.anthropologie.com/is/image/Anthropologie/43663541_000_b2|https://images.anthropologie.com/is/image/Anthropologie/43663541_000_b3|https://images.anthropologie.com/is/image/Anthropologie/43663541_000_b4|http://images.anthropologie.com/is/image/Anthropologie/43663541_000_b|http://images.anthropologie.com/is/image/Anthropologie/43663541_000_b2|http://images.anthropologie.com/is/image/Anthropologie/43663541_000_b3|http://images.anthropologie.com/is/image/Anthropologie/43663541_000_b4|http://images.anthropologie.com/is/image/Anthropologie/43663541_049_b|http://images.anthropologie.com/is/image/Anthropologie/43663541_049_b2|http://images.anthropologie.com/is/image/Anthropologie/43663541_049_b3|http://images.anthropologie.com/is/image/Anthropologie/43663541_049_b4",
        "name": "Floreat Printed Sleep Pants",
        "sku": "43663541",
        "url": "https://www.anthropologie.com/shop/floreat-printed-sleep-pants",
        "variations": "ASSORTED|BLUE MOTIF"
    }
}]
Михаил Сисин Со-основатель облачного сервиса по сбору информации и парсингу сайтов Diggernaut. Работает в области сбора и анализа данных, а также разработки систем искусственного интеллекта и машинного обучения  более десяти лет.

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *

Этот сайт защищен reCAPTCHA и применяются Политика конфиденциальности и Условия обслуживания применять.

Срок проверки reCAPTCHA истек. Перезагрузите страницу.