| End of Presentation. Click here to start again. |
| Infochimps.org Philip (flip) Kromer
![]()
![]()
![]() ![]() ![]()
![]()
= ![]() ![]()
≠ ?? II SQL Query Queer Eye Sequel ![]() ![]()
"Show me, by year and hospital, ratio of iatrogenic (doctor-caused) to overall incident rate, stratifying by median residency-completion year, hospital budget, local per-capita income, residency 'match' rank." Impact of 80-hour rule on Surgical Training? ![]()
+ + - !! ?? ![]() ![]() ![]() ![]() ![]()
What Kinds of Data? ●By
place: Local Housing Prices, School Test Scores, Crime Statistics,
Political Contributions, Demographics, Days of Sunshine, ... ●By Time: Lunar Phase, Kings of England, Population, Currency @PPP, Sorghum Exports, Length of Day, Cosmic Bkgd Rad Temp ●Patent filings, Court cases, SEC financial documents ●Spoken, Written frequency for every word in English; Scrabble- words; collocations, synsets, lemmatizations, translations ●Time Zones, SGML Entities, Calendrical Tables ●Census, BLS, USDA, DOT, ..., ... ●Genome/Protein/Cell ●Chemical Compounds, MSDS ●News Articles: time, keywords, referenced places, persons, &c ●Blog Corpora, Academic Papers, Mailing List/Usenet metadata ●Twitter, Continental Air, Flickr, Pirate Bay bittorrents, Linux Kernel commits ●Sports events from pitch trajectory/ball location to game, year, season, franchise ●Books, Music, Film, Art by creator, medium, work, &c ![]()
No Jet Packs until you do your chimpwork Current reality working with rich datasets: the stupidest, easiest, most readily shared tasks are the ones that take the longest ![]() ![]()
Finding Data Sucks Sharing Data Sucks Getting Data Sucks Using Data Sucks ![]()
Finding Data Sucks Ex: 1. Dirty Words ○Noisy ![]()
Finding Data Sucks Ex: 1. Dirty Words 2.Mileage Chart ○Exclusion ![]()
Finding Data Sucks Ex: 1. Dirty Words 2.Mileage Chart 3.Historical Currency Exchange Rate ○Masking ![]() ![]()
Finding Data Sucks Ex: 1. Dirty Words 2.Mileage Chart 3.Historical Currency Exchange Rate exchange rate dataset, historical exchange rate, exchange rate download, exchange rate dollar foreign Not in top 100 results for ![]() ![]()
Finding Data => Not Suck ●Domain-Specific ("Rich Open Datasets") Search ●Tagging ●Metadata-aware Search ![]()
Sharing Data Sucks ●Share data: either ○ no one finds it ○ everyone finds it Sharing Data => Not Suck ●Relieve the hosting burden ●Fast, free, infinite ●As centralized as you want. Getting Data Sucks ●Larcenous Prices ○Gillette, WY, pop 19,646 KGCC: ~20 arrivals/day http://flightaware.com/live/airport/KGCC/history/buy 2yrs, full data: $1365 2yrs, no owners: $1050 ![]()
Getting Data Sucks ●Larcenous Prices ●Restricted Access NCDC: 100 yrs Global Hourly Weather Data ○Free to redistribute, public domain ○Free for .edu, .gov, .mil ○$2000 for everyone else ![]()
Getting Data Sucks ●Larcenous Prices ●Restricted Access ●Redistribution Restrictions ![]()
Getting Data Sucks ●Larcenous Prices ●Restricted Access ●Redistribution Restrictions ●Legally Dubious Licensing 3.14159 26535 89793 23846 26433 83279 50288 41971 (c) 2008 Flip Kromer All Domestic & Foreign Rights Reserved, in perpetuity. Use of this data without a pre-authorized license is prohibited. No yuo. Getting Data Sucks ●Legally Dubious Licensing ○RIPE DNS Regional Registry - WHOIS database "Public
database that contains information about registered IP address space
and AS numbers, routing policies, and reverse DNS delegations in the
RIPE region." # Getting Data Sucks ●Larcenous Prices ●Restricted Access ●Redistribution Restrictions ●Legally Dubious Licensing ●Uncertain Licensing ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Getting Data Sucks ●Larcenous Prices ●Restricted Access ●Redistribution Restrictions ●Legally Dubious Licensing ●Uncertain Licensing ●Uncertain Legality ??? ![]()
Getting Data Sucks ●Uncertain Legality ○Whitburn Project "For
the last ten years, obsessive record collectors in Usenet have been
working on the Whitburn Project ... created a spreadsheet of 37,000
songs and 112 columns of raw data, including each song's duration,
beats-per-minute, songwriters, label, and week-by-week chart position.
It's 25 megs of OCD, and it's awesome." ![]() ![]()
Getting Data Sucks ![]() ![]()
Getting Data => Not Suck ●You get the whole thing. Have Fun! ●No additional restrictions on use or redistribution ●Clear summary of pre-existing rights&restrictions ●Always Completely Free and Open ![]()
Using Data Sucks ●Foolish Formats XML - YAML - JSON CSV - TSV - XLS OOXML Flat (columnar) Text STATA - SASS Custom Text Custom Text with Poorly-thought out Quoting Data Formats => Not Suck Infinite Monkeywrench ●Lightweight, modular agnostic toolkit for munging datasets and managing workflow ●Data=> ActiveRecord /sqlite as ORM => YAML/XML/JSON/CSV/* ![]()
Data Formats => Not Suck ●Wikipedia Principle: "The stuff that's interesting or useful is the stuff that people take time to improve" ![]()
Using Data Sucks ●Foolish Formats ●Stupid Structure Statistical Abstract: 1300+ tables; each has ○inline footnotes ○Compound column heads: | Population | |1700-1800 | 1800-1900| ○Unclear Start-of-table / End-of-table Using Data Sucks ●Foolish Formats ●Stupid Structure ●Missing Metadata ![]()
Metadata => Not Suck Collaborative Curation ●Infochimps Stupid Schema ○Everybody else's schema is wrong*, ours is only stupid ○*except Freebase ●Git-style versioning ●!!Write access bot-able!! ○you.infochimp.org ○other tweaks ![]()
Using Data Sucks ●Foolish Formats ●Stupid Structure ●Missing Metadata ●Uncertain Provenance ![]()
Provenance => Not Suck ●Trust Metrics for users, ●Value Metrics for products ●End-to-end: ○Pointer to source !!required!! ○munge scripts source ○SHA1 Fingerprint ○Digitally Signed Fingerprint ■(... trust metric) ![]()
Make Freebase &c Easier ●Nimble Traversal of 'properties' ○A country is a boundary, a spatial region, a location, a flag, a country code,... ○Pick datasets up by thread: "enforce comparisons" Make Freebase &c Easier ●Separate content from representation: ○"Length" (concept) <= furlongs, meters, point-point, light . seconds, ... ○% Population Growth = (People/People)% ○Frink / BSD Units / Freebase / SUMO Ontologies ○Things can have more than one concept ○"Do What Makes Sense => We'll Figure it Out Once There's A Million Examples" You Got Your Troubles I Got Mine ●Collaboration: ○Integrity/Vandalism /NPOV / Epistemology ●Federation / Distributiveness ●Sustainability End of Presentation.
Replay Close | ||||||||
|







































Next Slide