On the Geowankers list Andrea Moe made a query
> I have an existing collection of lat/lons, each representing a place where a
> photo was taken. I want to computationally find the geographic clusters in
> this collection, i.e. the geographic areas with the densest concentrations
> of points. (So it sounds like Andrew’s “location-closeness clustering” is
> what I’m thinking of.) Having found these most-photographed areas, I want to
> find the geographic name that best describes each area, such as a region,
> city, neighborhood or park name. So, I’m looking for two different things, a
> location-closeness clustering algorithm and a gazetteer lookup.
What he wants relates highly with what I want, and with something around which I have written a bit of code…so my response on the list seems (According to Schuyler
to be a fit topic for posting…
I have the precise same problem of having photographs and trying to
extract meaning from the clusters. I’ve been working on code to
scratch this itch, and I’d be happy to send it to anyone, or to work
with someone else to generalize the solution. The code is in Perl.
I also have track logs for all of the points where the photos were
taken. So I know when I was near to each photo (both when it was
initially taken, plus subsequent visits to the area). I also have a
collection of waypoints for the general area. And finally, I have a
collection of travel ephemera like ticket stubs and receipts, that all
have time stamps and which I’ve been geocoding based on the track
logs.
I wrote some perl code to show the waypoints that are closest to each
photo, and then to show the photos that are close to each waypoint.
And then to show when I was ‘near’ to each waypoint and to each
photograph.
I realized during this process that in many (but not all!) cases that
the nearest waypoint(s) to a picture made a pretty good tag for that
picture.
The pictures where ‘Vodensky’ was the nearest waypoint were, sure
enough, best tagged as ‘Vodensky’ for the Vodensky Military Museum.
And the pictures closest to the waypoint ‘ASIEN-GIRLS’ were in fact of
the strip club that featured ‘Asien Girls.’
This technique serves to create clusters of a sort. But in this case
the waypoints manually define the cluster centers…which is very
effective, but is sort of cheating
Schuyler wrote some cool clustering code for Google Maps Hacks. Here
is an example of the code in action:
http://mappinghacks.com/projects/gmaps/cluster.html
You click on markers with the black dots to zoom in on a cluster. You
click on a marker without a black dot to see information on an
individual point.
This shows most of my personal waypoints in a clever sort of
clustering. He started with a K-means clustering
(http://en.wikipedia.org/wiki/K_means) and then futzed with it a bit
in order to make it look better.
This is not the only, or maybe even best, clustering algorithm-but
analyzing the strengths of various clustering algorithms is beyond my
current abilities and interests.
Here is the same code used to show a cluster of my recent pictures:
http://mappinghacks.com/projects/gmaps/cluster_pix.html
In theory you can click on the markers without black dots and an info
box pops up with a thumbnail, and you can click on the thumbnail to
see all the pictures. This works except a) the thumbnails are not pre
loaded, so it can take a bit to load and more importantly b) there are
many ‘terminal’ clusters. By which I mean, clusters which can’t be
zoomed in more because the map is at maximum zoom. I should do
something about that, like show a list of points when they can’t be
zoomed further, or some such. Some day
The gazeteer lookup is either fairly easy, or a pita. The challenge
is in having a meaningful gazeteer. The US Geographic Names
Information Service data is great:
http://geonames.usgs.gov/ (there are also links there to the GEOnet
names server for international names).
The problem with the GNIS is that they have nearly two million names
for the continental US. An embarrassment of riches. You could get
pretty good data by using the populated place ‘class,’ but even there
you end up with lots of supposed populated place names which don’t
always match the names that people actually use (this ‘problem’ might
just be my personal lack of sensitivity/awareness to the historical
aspects of my location…). Using the ‘park’ class works well.
Getting regions and neighborhoods is a different challenge! What is
a region or a neighborhood? To some extent a ‘region’ could have an
objective location, but mostly regions and neighborhoods are social
constructs. As such they have fuzzy boundaries and they vary
(sometimes greatly!) with time.
This wikipedia definition of SoMa in San Francisco is an example of an
erroneous attempt at definition:
http://en.wikipedia.org/wiki/South_of_Market%2C_San_Francisco%2C_California
“The eastern edge along the Embarcadero and south-eastern corner of
this area (where Mission Creek meets the bay) is known as South Beach,
a separate neighborhood, and the border below Townsend Street begins
Mission Bay. The north-eastern corner (where Market Street meets the
bay) is often considered part of the Financial District.”
What ‘border’ below Townsend? and ‘often considered?’ This is an
_attempt_ to quantify a social construction…
The Neighborhood Project was/is an attempt at exploring the boundaries
of neighborhood based on what people believe. http://hood.theory.org/
They used the ‘Bloggy’ algorithm…and they have materials that talk
about that on their site. It is possible that you could use
metaballs/blobby objects rather than true clustering for your photos.
I love this description (from:
http://www.siggraph.org/education/materials/HyperGraph/modeling/metaballs/metaballs_mward.html):
” We can think of a metaball as a partical surrounded by a density
field, where the density attributed to the particle (its influence)
decreases with distance from the particle location. A surface is
implied by taking an isosurface through this density field - the
higher the isosurface value, the nearer it will be to the particle.
The powerful aspect of metaballs is the way they can be combined.”
If you really want the ‘clusters’ but you are willing to ignore the
outliers, you could ‘metaball’ your photos, and then assume that the
clusters are where you have contiguous areas. Or you could use the
centers of those ‘clusters’ as the initial centroids for a K-means
algorithm (that _seems_ like actually a pretty good idea).
I’ve been working on some code to implement a geographical data store
that would sort of intrinsically allow for the creation of user
defined ‘areas’ or ‘regions’ or ‘neighborhoods.’
Anyway…I am very interested in these thoughts. I’d love to
collaborate with you, or anyone else, on these areas. I’m currently
working in Perl and Ruby on Rails using MySQL and Postgis.
Posted in geodata, collaborative mapping, data, qpsycho, software |
You can follow any responses to this entry through the RSS 2.0 feed.
Trackback from your own site.