Hacker News new | past | comments | ask | show | jobs | submit login

I encountered the same problem a few years ago and indeed realized that using categories to understand what type of article a thing was (person? subject? event?) was utterly useless, for the reasons you describe.

On the other hand, I discovered that infoboxes (the data in the top-right box on most pages) was generally extremely reliable, if frustrating to parse.




The infoboxes are created from a query to Wikidata, which you can query yourself! No scraping necessary! https://query.wikidata.org/

You'll want to learn SPARQL, but if you know SQL it's not so bad to pick up.


As far as I can tell, that is not the case, sadly.

Right now it appears that only 3,975 articles have infoboxes auto-generated from Wikidata. [1] The wikitext contains something like "{{Wikidata Infobox ...}}" instead of just "{{Infobox ...}}".

If you look up a popular article like Barack Obama [2], it's just a traditional hand-edited infobox. In fact, one of the first lines of data says "Vice President = Joe Biden", while the Wikidata entry for Barack Obama [3] doesn't reference Biden anywhere -- so not only is the Wikipedia infobox not generated from Wikidata, but Wikidata isn't pulling all the relevant info from Wikipedia either.

Back when I had been working on my project, I'd hoped Wikidata could be a solution but it was far too incomplete and information was regularly out of date. Perhaps (hopefully) it's better now, but it's clearly not being used to power infoboxes yet except in a tiny number of cases. (Which actually complicates things more now, since anybody parsing Wikipedia infoboxes now has to deal separately with the 3,975 ones that grab from Wikidata, since none of the actual data is copied over into the wikitext...)

[1] https://en.wikipedia.org/wiki/Category:Articles_with_infobox...

[2] https://en.wikipedia.org/wiki/Barack_Obama

[3] https://www.wikidata.org/wiki/Q76


For sure, thank you for the correction, I was under the impression it's role was broader.


Wikidata is not solution at all.

I recently run into the same kind of problem in Wikidata.

https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Onto...

typical problem is of "light rail (Q1268865) is data visualization (Q6504956)" kind - this specific is fixed, but there are many similar

https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/...

https://www.wikidata.org/wiki/Wikidata:Project_chat#Ontology...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: