Intro to Data Science : Lesson 5
Intro to Data Science Online Course - Udacity
用途
mappers
- reducers
- Basic MapReduce
- Word count
- debug log
- import logging
- logging.info("My debugging message")
word_counts
word_counts = {} for line in sys.stdin: data = line.strip().split(" ") for word in data: result = word.translate(string.maketrans("",""), string.punctuation).lower() #logging.info(result) if not result: continue if result in word_counts: word_counts[result] += 1 else: word_counts[result] = 1 #logging.info(word_counts) print word_counts
- dictionaryのキーの有無
- if result in word_counts:
- [NG] if result in word_counts.keys():
- [???] if result in word_counts.key():
- dictionary - counting duplicate words in python the fastest way - Stack Overflow
- punctuation 句読点
- string.punctuation
- forの継続 -> continue
- 空白文字確認
- if not word:
莫大な本のデータなどでMapReduceが役に立つ
mapper, reducer
adhar インドにおけるバイオメトリクス国民IDプロジェクト(NTTデータ/DIGITAL GOVERNMENT&FINANCIAL TOPICS) - IAIS |一般社団法人 行政情報システム研究所
mapper
for line in sys.stdin: data = line.strip().split(",") if len(data) !=12 or line.startswith("Registrar"): continue print "{0}\t{1}".format(data[3], data[8])
reducer
word_counts = {} for line in sys.stdin: data = line.strip().split("\t") if not data[0] in word_counts: word_counts[data[0]] = 0.0 word_counts[data[0]] += float(data[1]) for key in word_counts: print "{0}\t{1}".format(key, word_counts[key])
不正解
模範解答も動かない
BUS 597 Directed Study - Introduction to Analytics: Lesson 5 - MapReduce
aadhaar_generated = 0 old_key = None for line in sys.stdin: data= line.strip().split("\t") this_key,count = data aadhaar_generated += float(count) if old_key and old_key != this_key: print "{0}\t{1}".format(data[0],aadhaar_generated) aadhaar_generated = 0 old_key = this_key
- HADOOP
- Hive
- Pig
- Piglatin
- Yahoo
- Mahout
- Cassandora
時にはコンプリートにこだわらない
1.5h