Scraping text after a span in with Regex (and Requests)
Posted By: Anonymous
I have an unformatted and messy bs4.BeautifulSoup
element from a webpage. The soup
looks like this.
soup = ' </span><span class="productConfiguration__shippingDateEnd">Jul 30, 2021</span>"},{"id":"50014999","description":null,"displayValue":"M","value":"M","selected":false,"selectable":true,"url":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Variation?dwvar_2947_pv_rahmenfarbe=YE%2FBK&dwvar_2947_pv_rahmengroesse=M&pid=2947&quantity=1","hasComingSoon":true,"hasAllComingSoonAttr":true,"configurationUrl":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Configure?pid=2947&dwvar_2947_pv_rahmengroesse=M&dwvar_2947_pv_rahmenfarbe=YE%2fBK","sizeMin":178,"sizeMax":184,"measurementInterval":"178 cm - 184 cm","comingSoonReason":"productOrPreferenceInstockDate","comingSoon":true,"availability":{"messages":["Back order"],"inStockDate":"2021-08-09T00:00:00.000Z","onlyXLeftNumber":122,"onlyXLeft":false,"lowStock":false,"shippingInfo":"Coming in August 2021","available":false,"availableSufficient":true,"notifyMe":true,"showOutOfStock":false,"similarBikes":false,"comingSoonByBackOrderAllocation":false},"hasSuccessorProduct":false,"comingSoonMessage":"Coming in August 2021"},
{"id":"50015000","description":null,"displayValue":"L","value":"L","selected":false,"selectable":true,"url":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Variation?dwvar_2947_pv_rahmenfarbe=YE%2FBK&dwvar_2947_pv_rahmengroesse=L&pid=2947&quantity=1","hasComingSoon":true,"hasAllComingSoonAttr":true,"configurationUrl":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Configure?pid=2947&dwvar_2947_pv_rahmengroesse=L&dwvar_2947_pv_rahmenfarbe=YE%2fBK","sizeMin":184,"sizeMax":190,"measurementInterval":"184 cm - 190 cm","comingSoonReason":"productOrPreferenceInstockDate","comingSoon":true,"availability":{"messages":["Back order"],"inStockDate":"2021-08-16T00:00:00.000Z","onlyXLeftNumber":96,"onlyXLeft":false,"lowStock":false,"shippingInfo":"Coming in August 2021","available":false,"availableSufficient":true,"notifyMe":true,"showOutOfStock":false,"similarBikes":false,"comingSoonByBackOrderAllocation":false},"hasSuccessorProduct":false,"comingSoonMessage":"Coming in August 2021"},
{"id":"50015001","description":null,"displayValue":"XL","value":"XL","selected":false,"selectable":true,"url":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Variation?dwvar_2947_pv_rahmenfarbe=YE%2FBK&dwvar_2947_pv_rahmengroesse=XL&pid=2947&quantity=1","hasComingSoon":true,"hasAllComingSoonAttr":true,"configurationUrl":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Configure?pid=2947&dwvar_2947_pv_rahmengroesse=XL&dwvar_2947_pv_rahmenfarbe=YE%2fBK","sizeMin":190,"sizeMax":196,"measurementInterval":"190 cm - 196 cm","comingSoonReason":"productOrPreferenceInstockDate","comingSoon":true,"availability":{"messages":["Back order"],"inStockDate":"2021-08-09T00:00:00.000Z","onlyXLeftNumber":38,"onlyXLeft":false,"lowStock":false,"shippingInfo":"Coming in August 2021","available":false,"availableSufficient":true,"notifyMe":true,"showOutOfStock":false,"similarBikes":false,"comingSoonByBackOrderAllocation":false},"hasSuccessorProduct":false,"comingSoonMessage":"Coming in August 2021"},
{"id":"50015002","description":null,"displayValue":"2XL","value":"2XL","selected":false,"selectable":true,"url":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Variation?dwvar_2947_pv_rahmenfarbe=YE%2FBK&dwvar_2947_pv_rahmengroesse=2XL&pid=2947&quantity=1","hasComingSoon":false,"hasAllComingSoonAttr":false,"configurationUrl":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Configure?pid=2947&dwvar_2947_pv_rahmengroesse=2XL&dwvar_2947_pv_rahmenfarbe=YE%2fBK","sizeMin":196,"sizeMax":999,"measurementInterval":"> 196 cm","comingSoonReason":"","comingSoon":false,"availability":{"messages":["Back order"],"inStockDate":"2021-07-26T00:00:00.000Z","onlyXLeftNumber":10,"onlyXLeft":false,"lowStock":false,"shippingInfo":"Shipping <span class="productConfiguration__shippingDate">Jul 26, 2021</span><span class="productConfiguration__shippingDateSeparator"> - </span><span class="productConfiguration__shippingDateEnd">Jul 30, 2021</span>","available":true,"availableSufficient":true,"notifyMe":false,"showOutOfStock":false,"similarBikes":false,"comingSoonByBackOrderAllocation":false},"hasSuccessorProduct":false,"comingSoonMessage":"Shipping <span class="productConfiguration__shippingDate">Jul 26, 2021</span><span class="productConfiguration__shippingDateSeparator"> - </span><span class="productConfiguration__shippingDateEnd">Jul 30, 2021</span>"}],"resetUrl":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Variation?dwvar_2947_pv_rahmenfarbe=YE%2FBK&dwvar_2947_pv_rahmengroesse=&pid=2947&quantity=1","hasSelectedValue":false,"isLastAttributeOnPDP":true,"colorAttribute":false,"sizeAttribute":true,"buttonAttribute":false,"damagedAttribute":false}]}};</script>'
I need the elements after the span class =productConfiguration__shippingDateEnd
i.e the "id" dictionary so that i can have something like this after the search.
{"id":"50015002","description":null,"displayValue":"2XL","value":"2XL","selected":false,"selectable":true,"url":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Variation?dwvar_2947_pv_rahmenfarbe=YE%2FBK&dwvar_2947_pv_rahmengroesse=2XL&pid=2947&quantity=1","hasComingSoon":false,"hasAllComingSoonAttr":false,"configurationUrl":"https://www.xzy.com/on/demandware.store/Sites-RoW-Site/en_DE/Product-Configure?pid=2947&dwvar_2947_pv_rahmengroesse=2XL&dwvar_2947_pv_rahmenfarbe=YE%2fBK","sizeMin":196,"sizeMax":999,"measurementInterval":"> 196 cm","comingSoonReason":"","comingSoon":false,"availability":{"messages":["Back order"],"inStockDate":"2021-07-26T00:00:00.000Z","onlyXLeftNumber":10,"onlyXLeft":false,"lowStock":false,"shippingInfo":"Shipping}'
If i do soup1.find_all('span', class_ = 'productConfiguration__shippingDateEnd')
i only get this result. Also .next_siblings
doesnt return anything.
[<span class="productConfiguration__shippingDateEnd">Jul 30, 2021</span>,
[<span class="productConfiguration__shippingDateEnd">Jul 30, 2021</span>,
Any ideas how i can go about here. ?
Thanks a lot for your help.
Solution
What I see looks slightly different from as shown but contains stock info by size. You can use regex to extract the string, then json to handle turning the string into a json object.
import requests, re, json
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.canyon.com/en-de/road-bikes/endurance-bikes/endurace/cf-sl/endurace-cf-sl-7-disc/2947.html?dwvar_2947_pv_rahmenfarbe=YE%2FBK')
s = re.search(r'window.deptsfra=(.*);', r.text).group(1)
#print(s)
data = json.loads(s)
print(data)
from pprint import pprint
pprint(data['productDetail']['variationAttributes'][1]['values'])
for i in data['productDetail']['variationAttributes'][1]['values']:
print(i['value'], i['availability'])
Values as shown in the table as a dict:
results = {i['value']: (bs(i['availability']['shippingInfo']).get_text() if '<' in i['availability']['shippingInfo'] else i['availability']['shippingInfo']) for i in data['productDetail']['variationAttributes'][1]['values']}
Regex explanation:
Answered By: Anonymous
Disclaimer: This content is shared under creative common license cc-by-sa 3.0. It is generated from StackExchange Website Network.