Scraping Wikipedia for coordinate information generates a rich, out-of-the-box data set that would otherwise take years to produce.
At Placemarkt, I’m working to make it easier to organize and discover places in your community. We’re sort of a paired-down throwback to Gowalla (if you’ve got a good memory) or Foursquare where focus is on creating and organizing locations, as well as routes, in a way that protects your personal information. Your data isn’t sold or used for anything at all. Instead, you pay an annual fee.
As the site has grown, I’ve been looking at way to provide users more information about their communities. We use OpenCage for forward and reverse geocoding and Thunderforest for our map tiles. As both are powered by OpenStreetMap data, we figured the open source world would be a great place to look in our search to build a richer, more textured map. Enter Wikipedia.
When the subject of a Wikipedia article pertains to a geographic location that can be represented by the coordinate system, that article will often display a map, or at the very least, lat/lng coordinates. The Wikipedia template language represents these coordinates as follows:
Given Wikipedia’s 6 million plus articles, this represents a treasure trove of interesting information and, luckily, Wikipedia’s commitment to the open source community means that they make all of their data available for download.
Of course it’s not quite as simple as just parsing a Wikipedia provided CSV file with article id, title, and coordinate. Because that doesn’t exist. Instead, we needed to parse the wiki dumps and generate such a file ourselves. To do so we did the following:
Grab the XML Wikipedia data dump and index files
The data dump file encompasses the entirety of English language Wikipedia. The index file contains the byte offset of each article in the data dump.
Use the index file to decompress portions of the xml file with dd
This allowed us to sequentially inspect the entirety of Wikipedia on a small server by using dd to cut the data dump file into smaller chunks that could be decompressed and processed one at a time.
Parse the output for coordinates using a combination of grep, xmlstarlet, and C
Grabbing each article id, we were able to then use a regex to pull out the coordinate template we then parsed with a small C program. Here’s the Github repo with the the bash script we used to do all of this. Contributions are always welcome!
In the end, we were able to grab over 400,000 coordinates from the data dump and produce a CSV file with article id, title, and coordinates that looks like this:
63205443|Swift Valley Nature Reserve|52.39405 -1.25831 63205477|Deskie Castle|57.35540 -3.33310 63207626|Llewellyn Glacier|59.08333 -134.08333 63207822|Baghdad (North Gate) War Cemetery|33.35510 44.38620 63207930|Hang Ten Icefield|58.87500 -133.71667 63209775|Vistula Spit canal|54.35556 19.31111 63211601|Fairview Cemetery, Niagara Falls|43.10800 -79.08900 63213750|Man Uk Pin|22.52638 114.18437 63213728|Higashi-Ikebukuro runaway car accident|35.72606 139.71889 63213956|Kinkelenburg Castle|51.88960 5.89617 63214322|Shenzhen Bogang F.C.|22.73208 113.81822 63214383|Scottsdale National Golf Club|33.74878 -111.81502 63214411|Kensington House (academy)|51.50209 -0.18652 63216802|Hercules Powder Plant Disaster|40.87200 -74.63700 63217229|Fiestas patronales de Ponce|18.00000 -66.61667 63218528|Heung Yuen Wai|22.52732 114.19866
Here is a CSV file you can download containing all of Wikipedia’s coordinate information should you not want to run the script I’ve provided. The file will be updated each month to keep up with new Wikipedia contributions.
So, what’s next? First, I’ll be working to turn these Wikipedia coordinates into interactive locations in the Placemarkt site. Next, I’ll be adding trail and hiking-related information. We’ve also been working on an iOS app that should be coming out soon.