Data Mining Emigration Rate

July 13, 2012

The Internet has enabled quite a few things. Foremost of these is a new paradigm for commerce, but buried among all the digital debris is a new investigational method called data mining. In data mining, large volumes of data are analyzed to discover unexpected correlations, or quantify a suspected correlation.

Data mining has become an important specialty, and I've mentioned it in quite a few previous articles:

Data, Data, Everywhere (February 7, 2007)  »   The Exchangeable Image File Format (EXIF) data of digital photographs on the Flickr photo-sharing web site were used to rank the most popular cameras. Similar data were used to determine whether people preferred sunrise or sunset (sunset is slightly more popular than sunrise). It was also possible to extract from these data reasonable curves of the times for sunrise and sunset for the narrow band of latitudes in which the majority of Earth's population resides.

Basic Research (October 22, 2010)  »   I repeated a data-mining exercise, first done by Roger Pielke, Jr., of the Center for Science and Technology Policy Research of the University of Colorado at Boulder. I looked for the occurrence of the phrase, "basic research," in The New York Times as a function of time. Interest in basic research peaked quite suddenly after it was realized that technology derived from what had been very basic research had won World War II. There was an additional surge after Sputnik was launched. Today, basic research seems to be in retreat.

Culturomics (January 13, 2011)  »   The Ngram Viewer from Google is a way to examine the frequency of occurrence of words as a function of time for the many books and other publications that Google has indexed. The Ngram concept has spawned a web site, www.culturomics.org with analyses of cultural trends. Such analysis found that 8,500 new words enter the English language annually, but many of these aren't found in dictionaries. It was also discovered that the rate of mention of individual inventions was about twice as fast at the end of the nineteenth century than at its start.

Hedonometrics (January 31, 2011)  »   Data mining of Twitter accounts solved a question that I had since childhood; namely, when do most families have dinner. Six PM is the time most people have dinner, although anytime from 5:00 PM - 7:00 PM is nearly as likely. A research team has tracked words expressing happiness from 50 million Twitter accounts to create a "Hedonometer" that shows the instantaneous happiness index of Twitter users; and, by inference, the world at large.

Numb3rs (June 6, 2012)  »   Remote sensing of the Earth via satellite has revealed an interesting correlation between the density of trees and the income level of urban neighborhoods. Each percent increase in per capita income correlates with an increased tree cover of 1.76 percent.

People have been especially mobile in the past few centuries. The United States was populated by people from other countries, and there is considerable emigration, today, of people between countries.

Countries are more likely to keep track of those who enter than those who leave. Data on which people enter a country, as shown in the following figure, are often easy to find, but data on how many people leave specific countries are not tabulated.

Distribution of emigration of Poles

Distribution of emigration of Poles (More red = greater numbers). There are many people of Polish origin in the United States. My maternal grandparents were both Polish immigrants to the US. (Via Wikimedia Commons))

To tackle the problem of accurately determining emigration rates, scientists from the Max Planck Institute for Demographic Research (MPIDR, Rostock, Germany) and Yahoo! Research used a data mining technique involving email.[1-3] The technique allowed assemblage of emigration statistics for nearly every country of the world. These mined data included the emigrant's gender and age, something that's rarely possible with official statistics for emigration.

Emilio Zagheni of MPIDR, and one of the authors of the study presented at a meeting of the Association for Computing Machinery (ACM),[3] summarized the problems with official records: the data are outdated and inconsistent; official records are difficult to use; emigrants tend not to leave a trail; and there is also no clear definition of whom should be called a migrant.[1-2]

The data mining idea is simple - You are from where you email. Zagheni and co-author Ingmar Weber of Yahoo! Research used the IP address of email messages of 43 million anonymous Yahoo! accounts between September 2009 and June 2011 for geolocation.[1-2] These accounts contained the self-reported birthdate and gender of the sender.

When an account-holder started to send emails exclusively from a different location, it was presumed that the person had moved. The subject and content of a message were not accessed, and the account-holders were kept anonymous, a feature of this study that pleases me and other Internet privacy advocates.[4]

The study produced the first data of US emigration by age and gender, as shown in the figure. Said Zagheni, "In the U.S., many statistics are collected about people who move into the country, but there is no system that keeps track of people who move out."[1]

US emigration rate, 2009-2011

US emigration rate, 2009-2011, as determined by the data mining technique of ref. 3. [3] (Graph rendered by the author using Gnumeric)

The study also addresses the interesting example of mobility across the US-Mexican border. Emigrants from Mexico to the US generally spent time in the US before their move, or visited Mexico shortly after their move to the US. People in their 30s were more likely to emigrate from Mexico to the US than people in their fifties, or older.[3]

There was, of course, considerable manipulation of the data to remove spammers, etc.; and to adjust for the fact that older people don't email as often as younger people.[1-2] The algorithms for this were well considered, so these data appear reliable.


