quattro_4 scribble

scribble 落書き (調べた事をただ落書きする)

Intro to Data Science : Lesson 5

Intro to Data Science Online Course - Udacity

MapReduce

word_counts

    word_counts = {}

    for line in sys.stdin:
        data = line.strip().split(" ")
        
        for word in data:
            result = word.translate(string.maketrans("",""), string.punctuation).lower()
            #logging.info(result)
            if not result:
                continue

            if result in word_counts:
                word_counts[result] += 1
            else:
                word_counts[result] = 1
    
    #logging.info(word_counts)
    print word_counts

莫大な本のデータなどでMapReduceが役に立つ

mapper, reducer

adhar インドにおけるバイオメトリクス国民IDプロジェクト(NTTデータ/DIGITAL GOVERNMENT&FINANCIAL TOPICS) - IAIS |一般社団法人 行政情報システム研究所

mapper

    for line in sys.stdin:
        data = line.strip().split(",")     
        if len(data) !=12 or line.startswith("Registrar"):
            continue

        print "{0}\t{1}".format(data[3], data[8])

reducer

    word_counts = {}
    
    for line in sys.stdin:
        data = line.strip().split("\t")
        if not data[0] in word_counts:
            word_counts[data[0]] = 0.0
        word_counts[data[0]] += float(data[1])
    
    for key in word_counts:
        print "{0}\t{1}".format(key, word_counts[key])

不正解

模範解答も動かない

BUS 597 Directed Study - Introduction to Analytics: Lesson 5 - MapReduce

    aadhaar_generated = 0
    old_key = None

    for line in sys.stdin:
         data= line.strip().split("\t")   
         this_key,count = data
         aadhaar_generated += float(count)

         if old_key and old_key != this_key:
            print "{0}\t{1}".format(data[0],aadhaar_generated)
            aadhaar_generated = 0

         old_key = this_key
  • HADOOP
    • Hive
    • Pig
      • Piglatin
      • Yahoo
  • Mahout
  • Cassandora

時にはコンプリートにこだわらない

1.5h