loop and eliminate unwanted lines with beautiful soup
Posted By: Anonymous
I have an html file of a city’ ways, from which I want to extract only those which are secondary, and its following lines (extract below):
<node id="8762697302" visible="true" version="1" changeset="105251293" timestamp="2021-05-24T21:31:46Z" user="4TL4S" uid="12781275" lat="19.5021226" lon="-99.1210088"/>
<node id="8762697303" visible="true" version="1" changeset="105251293" timestamp="2021-05-24T21:31:46Z" user="4TL4S" uid="12781275" lat="19.5021537" lon="-99.1210855"/>
<node id="8762697304" visible="true" version="1" changeset="105251293" timestamp="2021-05-24T21:31:46Z" user="4TL4S" uid="12781275" lat="19.5021738" lon="-99.1211046"/>
<node id="8762697305" visible="true" version="1" changeset="105251293" timestamp="2021-05-24T21:31:46Z" user="4TL4S" uid="12781275" lat="19.5022129" lon="-99.1211099"/>
<way id="24984236" visible="true" version="36" changeset="105251293" timestamp="2021-05-24T21:31:46Z" user="4TL4S" uid="12781275">
<nd ref="271534238"/>
<nd ref="271534237"/>
<nd ref="301605624"/>
<nd ref="8130722656"/>
<nd ref="271534236"/>
<nd ref="301605886"/>
<nd ref="8490482530"/>
<nd ref="271534235"/>
<nd ref="8130722659"/>
<nd ref="297808621"/>
<nd ref="5120247163"/>
<nd ref="8500986642"/>
<nd ref="8112567831"/>
<nd ref="8336910886"/>
<nd ref="8336910883"/>
<nd ref="8336910885"/>
<nd ref="8112567832"/>
<nd ref="8336910884"/>
<nd ref="8336910887"/>
<nd ref="271534230"/>
<nd ref="8112567834"/>
<nd ref="8762697298"/>
<nd ref="8112567833"/>
<nd ref="6348455382"/>
<tag k="highway" v="secondary"/>
<tag k="lanes" v="3"/>
<tag k="name" v="Avenida Acueducto de Guadalupe"/>
<tag k="oneway" v="yes"/>
<tag k="surface" v="asphalt"/>
</way>
<way id="24984237" visible="true" version="50" changeset="100730322" timestamp="2021-03-09T19:35:26Z" user="TheShiningAlbatross" uid="11724618">
<nd ref="1789642294"/>
<nd ref="298263634"/>
<nd ref="6348437061"/>
<nd ref="297274089"/>
<nd ref="8109075718"/>
<nd ref="297387276"/>
<nd ref="297274088"/>
<nd ref="8089031454"/>
<nd ref="271535272"/>
<nd ref="297387125"/>
<nd ref="271535273"/>
<nd ref="271535274"/>
<nd ref="8089403582"/>
<nd ref="5272807864"/>
<nd ref="271535275"/>
<nd ref="5272807871"/>
<nd ref="271535276"/>
<nd ref="8500972920"/>
<nd ref="8089235401"/>
<nd ref="8089235393"/>
<nd ref="297373675"/>
<tag k="highway" v="secondary"/>
<tag k="lanes" v="3"/>
<tag k="name" v="Avenida Instituto PolitÊcnico Nacional"/>
<tag k="oneway" v="yes"/>
<tag k="surface" v="asphalt"/>
</way>
<way id="27093652" visible="true" version="5" changeset="100666370" timestamp="2021-03-09T00:06:55Z" user="TheShiningAlbatross" uid="11724618">
<nd ref="297274089"/>
<nd ref="8498394999"/>
<nd ref="8498394998"/>
<nd ref="297274090"/>
<nd ref="298256487"/>
<nd ref="299379524"/>
<nd ref="297274091"/>
<nd ref="297274088"/>
<tag k="highway" v="service"/>
<tag k="oneway" v="yes"/>
<tag k="surface" v="asphalt"/>
</way>
<way id="27093653" visible="true" version="24" changeset="100661225" timestamp="2021-03-08T20:45:38Z" user="TheShiningAlbatross" uid="11724618">
<nd ref="8089031455"/>
<nd ref="8227092924"/>
<nd ref="298270527"/>
<nd ref="8227092918"/>
<nd ref="297275667"/>
<nd ref="1905088915"/>
<nd ref="8089365647"/>
<nd ref="8227089401"/>
<nd ref="3779095087"/>
<nd ref="3779095094"/>
<nd ref="3779095086"/>
<nd ref="3779095093"/>
<nd ref="1792764124"/>
<nd ref="1792764110"/>
<nd ref="1792767134"/>
<nd ref="6174887577"/>
<nd ref="297274093"/>
<nd ref="1792795130"/>
<nd ref="8498057567"/>
<nd ref="297274094"/>
<nd ref="1792764140"/>
<nd ref="8088692607"/>
<nd ref="1792764135"/>
<nd ref="8490529604"/>
<nd ref="8490529603"/>
<nd ref="297274095"/>
<nd ref="1792764131"/>
<nd ref="268538192"/>
<tag k="highway" v="tertiary"/>
<tag k="lanes" v="2"/>
<tag k="name" v="Calzada TicomÃĄn"/>
<tag k="oneway" v="no"/>
<tag k="surface" v="asphalt"/>
</way>
<way id="27093807" visible="true" version="22" changeset="95860337" timestamp="2020-12-15T08:51:08Z" user="Utsunomiya" uid="10074594">
<nd ref="8089031453"/>
<nd ref="6360545982"/>
<nd ref="297275687"/>
<nd ref="298281142"/>
<nd ref="298281139"/>
<nd ref="299381506"/>
<nd ref="6360545980"/>
<nd ref="297275694"/>
<nd ref="297275704"/>
<nd ref="6360545969"/>
<nd ref="297275707"/>
<nd ref="299381507"/>
<nd ref="1790748535"/>
<nd ref="297275708"/>
<nd ref="297275709"/>
<nd ref="1792449299"/>
<nd ref="1792449301"/>
<nd ref="8104327358"/>
<nd ref="8205206290"/>
<nd ref="299382462"/>
<nd ref="8205206222"/>
<nd ref="8205206221"/>
<nd ref="8230427925"/>
<nd ref="8089031453"/>
<tag k="addr:city" v="Ciudad de Mèxico"/>
<tag k="amenity" v="university"/>
<tag k="name" v="Centro de InvestigaciÃŗn y de Estudios Avanzados CINVESTAV"/>
<tag k="operator" v="Instituto PolitÊcnico Nacional"/>
<tag k="surface" v="asphalt"/>
</way>
<way id="27093966" visible="true" version="2" changeset="640886" timestamp="2008-09-15T18:30:34Z" user="yvasilev" uid="23179">
<nd ref="297277371"/>
<nd ref="297277373"/>
<nd ref="297277375"/>
<nd ref="297277377"/>
<nd ref="297277371"/>
<tag k="building" v="yes"/>
<tag k="created_by" v="Merkaartor 0.11"/>
<tag k="name" v="PatologÃÂa Experimental y FisiologÃÂa"/>
</way>
<way id="27093967" visible="true" version="2" changeset="640886" timestamp="2008-09-15T18:30:35Z" user="yvasilev" uid="23179">
<nd ref="297277385"/>
<nd ref="297277388"/>
<nd ref="297277390"/>
<nd ref="297277392"/>
<nd ref="297277385"/>
<tag k="building" v="yes"/>
<tag k="created_by" v="Merkaartor 0.11"/>
<tag k="name" v="FisiologÃÂa"/>
</way>
<way id="27093969" visible="true" version="2" changeset="640886" timestamp="2008-09-15T18:30:36Z" user="yvasilev" uid="23179">
<nd ref="297277396"/>
<nd ref="297277398"/>
<nd ref="297277400"/>
<nd ref="297277405"/>
<nd ref="297277396"/>
<tag k="building" v="yes"/>
<tag k="created_by" v="Merkaartor 0.11"/>
<tag k="name" v="BioquÃÂmica"/>
</way>
<way id="27093972" visible="true" version="2" changeset="640886" timestamp="2008-09-15T18:30:36Z" user="yvasilev" uid="23179">
<nd ref="297277414"/>
<nd ref="297277415"/>
<nd ref="297277416"/>
<nd ref="297277417"/>
<nd ref="297277414"/>
<tag k="building" v="yes"/>
<tag k="created_by" v="Merkaartor 0.11"/>
<tag k="name" v="GenÊtica, PatologÃÂa 1a secciÃŗn"/>
</way>
so, when i do:
soup.find('way')
i get (I know .find just gets first result):
<way changeset="105251293" id="24984236" timestamp="2021-05-24T21:31:46Z" uid="12781275" user="4TL4S" version="36" visible="true">
<nd ref="271534238"></nd>
<nd ref="271534237"></nd>
<nd ref="301605624"></nd>
<nd ref="8130722656"></nd>
<nd ref="271534236"></nd>
<nd ref="301605886"></nd>
<nd ref="8490482530"></nd>
<nd ref="271534235"></nd>
<nd ref="8130722659"></nd>
<nd ref="297808621"></nd>
<nd ref="5120247163"></nd>
<nd ref="8500986642"></nd>
<nd ref="8112567831"></nd>
<nd ref="8336910886"></nd>
<nd ref="8336910883"></nd>
<nd ref="8336910885"></nd>
<nd ref="8112567832"></nd>
<nd ref="8336910884"></nd>
<nd ref="8336910887"></nd>
<nd ref="271534230"></nd>
<nd ref="8112567834"></nd>
<nd ref="8762697298"></nd>
<nd ref="8112567833"></nd>
<nd ref="6348455382"></nd>
<tag k="highway" v="secondary"></tag>
<tag k="lanes" v="3"></tag>
<tag k="name" v="Avenida Acueducto de Guadalupe"></tag>
<tag k="oneway" v="yes"></tag>
<tag k="surface" v="asphalt"></tag>
</way>
From this results I would like to get rid of previous lines and only obtain its text, something like:
k="highway" v="secondary"
k="lanes" v="3"
k="name" v="Avenida Acueducto de Guadalupe"
k="oneway" v="yes"
k="surface" v="asphalt"
this is a very large file so i need to loop throught it, to then turn it into a table to process it with pandas. I haven’t figured how to do this, please help
Solution
One example how to create pandas DataFrame from the HTML file (your_file.html
contains HTML from the question):
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("your_file.html", "r").read(), "html.parser")
data = []
for way in soup.select("way"):
data.append({})
for tag in way.select("tag"):
data[-1][tag["k"]] = tag["v"]
df = pd.DataFrame(data).fillna("")
print(df)
Prints:
highway lanes name oneway surface addr:city amenity operator building created_by
0 secondary 3 Avenida Acueducto de Guadalupe yes asphalt
1 secondary 3 Avenida Instituto PolitÊcnico Nacional yes asphalt
2 service yes asphalt
3 tertiary 2 Calzada TicomÃĄn no asphalt
4 Centro de InvestigaciÃŗn y de Estudios Avanzad... asphalt Ciudad de Mèxico university Instituto PolitÊcnico Nacional
5 PatologÃÂa Experimental y FisiologÃÂa yes Merkaartor 0.11
6 FisiologÃÂa yes Merkaartor 0.11
7 BioquÃÂmica yes Merkaartor 0.11
8 GenÊtica, PatologÃÂa 1a secciÃŗn yes Merkaartor 0.11
Answered By: Anonymous
Disclaimer: This content is shared under creative common license cc-by-sa 3.0. It is generated from StackExchange Website Network.