Intro to Data Science : Lesson 2
Lesson 2: Data Wrangling
- 1分以内くらいの動画 30以上
- wrangle ~について口論[論争]する
- Files, Database, API
- Messy data
- Aquiring Data (取得)
- government site
- Baseball data - Sean Lahman | Database Journalist
- csv, xml, json, MS Access, sql etc.
import pandas baseball_data = pandas.read_csv('xx.csv') print baseball_data['yyy']
- Dash Docset
- Write to csv
- DataFrame.replace
- 列の加工
- aadhaar_data.rename(columns = lambda x: x.replace(' ', '_').lower(), inplace=True)
- Execute your SQL command against the pandas frame
- aadhaar_solution = pandasql.sqldf(q.lower(), locals())
Query
q = """ SELECT registrar, enrolment_agency FROM aadhaar_data LIMIT 50; """
- district 区、地区、区域
- last.fm REST REST Requests – Last.fm
- import json
- import requests
- r = requests.get('http://ws.audioscrobbler.com/2.0/?method=geo.gettopartists&country=spain&api_key=ddea7229e302ed03addcb3dabc01954e&format=json')
- artist = json.loads(r.text)[u'topartists'][u'artist'][0][u'name']
- Missing values come from occasional system error
- Door to door survey
- Dealing with missing data
- Partial deletion
- Listwise deletion
- Pariwise deletion
- Imputation 《統計》インピュテーション、データの補完◆欠測値にもっともらしいデータを補完する方法。
- Partial deletion
- NAを指定した値で埋める (平均で埋めるなど)
- fillna(value)
- baseball['weight'] = baseball['weight'].fillna(numpy.mean(baseball['weight']))
予想以上に長かった (1.5h + 0.5h)
サイトとしては説明の仕方がCodeSchoolに近い感じ