Intro to Data Science : Lesson 5

Intro to Data Science Online Course - Udacity

用途
- シェブロン油田
- eBay eコマース
- マルウェア
- 医療の質問データ
mappers
reducers
Basic MapReduce
- Word count
debug log
- import logging
- logging.info("My debugging message")

word_counts

    word_counts = {}

    for line in sys.stdin:
        data = line.strip().split(" ")
        
        for word in data:
            result = word.translate(string.maketrans("",""), string.punctuation).lower()
            #logging.info(result)
            if not result:
                continue

            if result in word_counts:
                word_counts[result] += 1
            else:
                word_counts[result] = 1
    
    #logging.info(word_counts)
    print word_counts

dictionaryのキーの有無
- if result in word_counts:
- [NG] if result in word_counts.keys():
- [???] if result in word_counts.key():
- dictionary - counting duplicate words in python the fastest way - Stack Overflow
punctuation 句読点
- string.punctuation
forの継続 -> continue
空白文字確認
- if not word:

莫大な本のデータなどでMapReduceが役に立つ

mapper, reducer

adhar インドにおけるバイオメトリクス国民IDプロジェクト(NTTデータ/DIGITAL GOVERNMENT＆FINANCIAL TOPICS) － IAIS ｜一般社団法人行政情報システム研究所

mapper

    for line in sys.stdin:
        data = line.strip().split(",")     
        if len(data) !=12 or line.startswith("Registrar"):
            continue

        print "{0}\t{1}".format(data[3], data[8])

reducer

    word_counts = {}
    
    for line in sys.stdin:
        data = line.strip().split("\t")
        if not data[0] in word_counts:
            word_counts[data[0]] = 0.0
        word_counts[data[0]] += float(data[1])
    
    for key in word_counts:
        print "{0}\t{1}".format(key, word_counts[key])

不正解

模範解答も動かない

BUS 597 Directed Study - Introduction to Analytics: Lesson 5 - MapReduce

    aadhaar_generated = 0
    old_key = None

    for line in sys.stdin:
         data= line.strip().split("\t")   
         this_key,count = data
         aadhaar_generated += float(count)

         if old_key and old_key != this_key:
            print "{0}\t{1}".format(data[0],aadhaar_generated)
            aadhaar_generated = 0

         old_key = this_key

HADOOP
- Hive
- Pig
  - Piglatin
  - Yahoo
Mahout
Cassandora

時にはコンプリートにこだわらない

1.5h

quattro_4 scribble

scribble 落書き (調べた事をただ落書きする)

Intro to Data Science : Lesson 5