Famous persons on street names worldwide.
All street names are taken from openstreetmap (osm), wikipedia links are manually added, street prefixes and suffixes are removed and equivalent streets are merged into a standard name and their frequencies counted. Each person has a summary extracted from wikipedia, together with a list of occupations and a pretty picture. The last step is the generation of a html page for each country and a worldwide summary with the above data.
The core part of the project is the list of street names for each country, the list of persons and their occupations and the json structures affecting the way the data is extracted and parsed from osm. That is, the affixes and equivalents files. All these can be easily modified with any text editor without the need to touch code. All else is composed of point-free ramda pipelines that transform, evolve, extract and compose this data in new ways. A mere 600 lines of code in all.
I've skipped unit tests and typescript and kept the speed of implementation with the idea that what I've started with initially will not be what I'll have implemented at the end. That assumption proved correct.
$ node ulitza.js download <country>
The original osm data is in an xml-like format called pbf.
This step saves the latest pbf file to data/osm/pbf/<country>.osm.pbf
. There
is no data handling or processing of any kind at in this step.
$ node ulitza.js extract <country>
The previously downloaded pbf file contains all kinds of data that a map usually
does, like lakes, restaurants and you lover's favorite place to watch the city
lights at night. This step figures out what entries are actually street names
and saves them locally (not git-commited) to data/osm/raw/<country>.json
as
city and street name pairs without any other parsing or filtering.
As an example, here a few entries from one of the countries better known to me,
["Constanța", "Strada Mihai Viteazu"],
["Constanța", "Strada Mihai Viteazu"],
["Constanța", "Strada Mihai Viteazu"],
["Cluj-Napoca", "Piața Mihai Viteazu"],
["Cluj-Napoca", "Piața Mihai Viteazul"],
["București", "Șoseaua Mihai Bravu"],
["București", "Strada Mihai Viteazul"],
["București", "Strada Mihai Vodă"],
I'll be coming back to this sample list in the next section.
$ node ulitza.js update <country>
The entries from the previous simplified example are all referring to the same
person, Mihai
Viteazul. First, for
Constanța, there is just one entry, the other two are duplicates. We get rid of
those in the current step. Secondly, Piața
refers to a city square, while
Șoseaua
and Strada
both mean Street
. These prefixes also go the way of the
waste dump at this point. How? They are specified as regexes in the
affixes.js file and each country has a different set of them. With
these out of the way, there is still the fact that all these names refer to the
same person. In equivalents.js, each country has a list of
persons' names that have different spellings but refer back to the same
person. This step replaces all such names with a standard name (the first item
in the next list) and filters out the duplicates. For our example,
["Mihai Viteazul", "Mihai Viteazu",
"Mihai Vodă",
"Mihai Vodă Viteazul",
"Mihai Bravu"]
After removing duplicates, affixes and reducing all the names to the standard
one, we're left with three instances of Mihai Viteazul
from the initial list
(you do the math).
The output of the current step is saved in data/persons/countries/ and includes the standard name, the frequency and a wikipedia link, initially empty, but filled in in cases where this step has run before and some links are available,
["Mihai Eminescu", 116, ""],
["Tudor Vladimirescu", 114, ""],
["Unirii", 112, ""],
["Mihai Viteazul", 105, ""],
["Libertății", 99, ""],
["Republicii", 97, ""],
At this point we're just having street names, be them of persons or of flowers and trees. The last ones we ignore. For the persons we manually add the English wikipedia entry, if it exists, or the native language one, otherwise. If none, we leave it empty. Take a look at one of the countries in data/persons/country for a concrete example.
I've only considered street frequencies greater than one. This defends primarily against garbage data and also against large output files.
This is also a ripe area of continuous updates and improvements. That is, adding new links for streets representing person names, remove them where they are plain wrong (it happens!) or update/edit them if you know the country better than I do and are confident in your expertise regarding your heroes and poets.
$ node ulitza.js wiki <country>
Extract a summary, image and a list of occupations for each person from
data/persons/country/<country>.json
based on their wikipedia links and add or
update the persons.json with the latest info. This
file is a sort of common database for persons from all countries.
Not all entries have a meaningful summary from which to extract occupations, the
occupations might not be complete, or the summary might be in the native
language. For these cases, I've manually added an occupations-extra
key. Ideally, I would go through each person and fill in the occupation
manually. But this is a good-enough temporary solution until that happens for
all persons.
$ node ulitza.js html <country>
$ node ulitza.js html-worldwide
$ node ulitza.js html-all-countries
Last step, generate html pages for each country and a worldwide summary and save
them in data/html/countries/<country>.html
. You can also generate all
countries at once, useful when some change affects everything, like the
template.html file. This one takes some time to
finish.
Slips happen. While developing this project, these utilities have come in handy.
$ node ulitza.js inspect <country> <regex>
Probably the least used, but to extract only the elements containing street names requires knowing what tags contain them
Let's see an output from,
$ node ulitza.js inspect romania "cuza vodă"
and the output,
{
"type": "node",
"tags": {
"addr:city": "Cluj-Napoca",
"addr:housenumber": "16",
"addr:postcode": "400107",
"addr:street": "Strada Cuza Vodă",
"amenity": "restaurant",
"delivery": "yes",
"diet:vegan": "yes",
"name": "Bistro Lovegan",
"opening_hours": "Mo-Fr 12:00-17:00",
"phone": "+40723698654",
"website": "https://www.meniudigital.ro/bistrolovegan"
}
}
This happens to be a tag for a cheap bodega, but the interesting parts are it's type ("node") and tags. From its tags it is clear this is a city and on a street.
For countries that are least represented, like China, India, Japan, etc, there might be different tags. Search your favorite poet and let me know if you discover anything new.
$ node ulitza.js check <country>
With so many persons, as a safety against errors, the check command finds equivalent names that are specified more than once for the given . So, if we would have both these entries for the same country (Romania, in this case),
["Cuza Vodă", "Alexandru Ioan Cuza",
"Cuza-Vodă",
"A. I. Cuza"]
...
["Alexandru Ioan Cuza", "Cuza Vodă"]
the above command will output,
$ node ulitza.js check romania
Duplicate Equivalents:
Cuza Vodă
Alexandru Ioan Cuza
Just make sure all equivalents are under a single entry and that there are no duplicate entries. This commands helps in that regard
$ node ulitza.js check <country>
If in the data/countries/<country>.json
we have duplicate links or not valid
entries for the link field or not wikipedia links, that should be fixed. For
example, assuming these entries somewhere in Romania's json file,
["Cuza Vodă", 95, "https://en.wikipedia.org/wiki/Alexandru_Ioan_Cuza"],
["Nicolae Balcescu", 15, "https://en.wikipedia.org/wiki/Alexandru_Ioan_Cuza"],
["Titu Maiorescu", 15, "https://news.ycombinator.com/"],
the above command will output,
$ node ulitza.js check romania
Duplicate Links:
https://en.wikipedia.org/wiki/Alexandru_Ioan_Cuza,2
Inconsistent Links:
https://news.ycombinator.com/
This is helpful, as "Nicolae Balcescu" is a different person and you might want to assign a different link to them. Also, the hacker news website is useful but not in extracting wiki summaries nor linking the person to their wikipedia pages (next steps) so that should also be remedied.
$ node ulitza.js check-all
Check all countries for such inconsistencies as mentioned above and output their
names. You should use the check
command for the respective country to dig in
for further details. As an example,
$ node ulitza.js check-all
Countries with duplicate equivalents:
spain
Countries with duplicate links:
gcc-states
iran
iraq
morocco
romania
tunisia
Countries with inconsistent links:
romania