Someone had asked for the Skype transcript from the Hadoop Lap; sorry I didn't get around to posting until now. Might still be useful for HW5.
Chat History with Hadoop Lab - Nicolas Pinto (#cs-264/$b465d41d756d96aa)
Created on 2009-10-28 13:21:51.
2009-10-21
cs-264: 13:12:20
did you guys hear the explanation of the homework schedule?
Lateral Punk: 13:12:27
yes
cs-264: 13:13:48
curl cs264.org/files/gutenberg.tar.gz | tar xzv
Lateral Punk: 13:13:54
thanks
cs-264: 13:13:55
to fetch data into /home/training
cs-264: 13:14:14
(contains 3 text files)
cs-264: 13:14:24
this is on your VM
cs-264: 13:14:31
all set?
Lateral Punk: 13:14:35
yes
Derek Schwenke: 13:14:55
I missed the hw schedule, sorry is there audio or video today?
Dave Sousa: 13:14:56
good thanks
cs-264: 13:15:02
there should be live video
cs-264: 13:15:39
key points is that homework five will come out later this week, have two works to work on, and be worth two assignments
cs-264: 13:16:11
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html for more on streaming (URL from slide)
Derek Schwenke: 13:16:20
Got the audio video feed... So this is archived and I can see it again to get the HW schedule
cs-264: 13:16:32
are you guys able to read the slides on the screen, or do I need to retype commands
Lateral Punk: 13:16:40
please type it
Lateral Punk: 13:16:41
can't see it
Lateral Punk: 13:16:50
sorry man

cs-264: 13:17:05
slides going up shortly

cs-264: 13:17:47
export SJAR=/usr/lib/hadoop/contrib/streaming/hadoop-0.20.1+133-streaming.jar
cs-264: 13:17:57
(use tab complete if you have a different version of the jarfile)
cs-264: 13:18:24
hadoop dfs -copyFromLocal gutenberg gutenberg
cs-264: 13:18:45
should see this output:
cs-264: 13:18:47
training@training-vm:~$ hadoop dfs -ls
Found 1 items
drwxr-xr-x - training supergroup 0 2009-10-21 10:18 /user/training/gutenberg
training@training-vm:~$ hadoop dfs -ls gutenberg
Found 3 items
-rw-r--r-- 1 training supergroup 674762 2009-10-21 10:18 /user/training/gutenberg/20417.txt
-rw-r--r-- 1 training supergroup 1573044 2009-10-21 10:18 /user/training/gutenberg/4300.txt
-rw-r--r-- 1 training supergroup 1395667 2009-10-21 10:18 /user/training/gutenberg/8ldvc10.txt
training@training-vm:~$
cs-264: 13:19:23
hmm... should figure out how to send my terminal to skype

cs-264: 13:19:27
questions?
Lateral Punk: 13:20:24
i get this
Lateral Punk: 13:20:25
training@training-vm:~$ hadoop dfs -ls
Found 1 items
drwxr-xr-x - training supergroup 0 2009-10-21 10:19 /user/training/gutenberg
cs-264: 13:20:39
yup. that's right, I ran two commands
Lateral Punk: 13:20:47
k
Lateral Punk: 13:22:25
can't see it
Lateral Punk: 13:22:44
yes keep it there
cs-264: 13:22:52
hadoop jar $SJAR -mapper cat -reducer "wc \-w" -input gutenberg -output gutenberg-cat-wc
cs-264: 13:23:14
https://docs.google.com/present/edit?id=0ARaSv-u0I2mFZGZuMjkzNXdfMzA2OGc0MjdtZ2I&hl=encs-264: 13:23:35
we're on slide 6
cs-264: 13:23:56
note I had to put a backslash so that hadoop wouldn't parse the -w as an option for it
cs-264: 13:25:25
sounds like backslash only matters if you have the newer hadoop version
Lateral Punk: 13:25:36
nop edoesn twork
Lateral Punk: 13:26:02
training@training-vm:~$ hadoop jar $SJAR -mapper cat -reducer 'wc \-w' -input gutenberg/ -output gutenberg-cat-wc
packageJobJar: [/var/lib/hadoop-0.20/cache/training/hadoop-unjar37105/] [] /tmp/streamjob37106.jar tmpDir=null
09/10/21 10:25:53 INFO mapred.FileInputFormat: Total input paths to process : 4
09/10/21 10:25:53 ERROR streaming.StreamJob: Error Launching job : Not a file: hdfs://localhost:8022/user/training/gutenberg/gutenberg
Streaming Command Failed!
Lateral Punk: 13:27:40
yes
cs-264: 13:28:41
so leave off the trailing forward slash on -inpu
cs-264: 13:28:43
t
cs-264: 13:28:51
causes some kind of nested ls issue
Lateral Punk: 13:28:54
nope that didn't do it for me
Lateral Punk: 13:28:57
let me see
Dave Sousa: 13:29:13
I left of the forward slash by accident and it seemed to work ok.
Dave Sousa: 13:29:28
oops ... should have said "left off'
cs-264: 13:29:58
yeah, I usually leave it off... I think it's cp -r that has really weird behavior if you're not careful with that

cs-264: 13:30:14
hadoop jar $SJAR -mapper cat -reducer "wc \-w" -input gutenberg -output gutenberg-cat-wc
cs-264: 13:30:19
does that not work for anyone still?
Lateral Punk: 13:31:11
how do u remove a directory/file from DFS?
Lateral Punk: 13:31:17
i think i get something extra
cs-264: 13:31:18
uhhhh
cs-264: 13:31:20
-rm?
cs-264: 13:31:29
hadoop dfs -rm
thatis
Lateral Punk: 13:32:55
thanks
Dave Sousa: 13:33:07
should we have mapper.py?
Lateral Punk: 13:33:10
i had something extra in my directory
cs-264: 13:33:18
no, this is a DIY exercise
Dave Sousa: 13:33:23
got it
cs-264: 13:33:25
language of choice
Lateral Punk: 13:36:36
what are we to do at slide 7?
cs-264: 13:37:36
write mapper.py and reducer.py
cs-264: 13:37:53
mapper.py takes line(s), writes "<token> 1" for each token on each line
Lateral Punk: 13:37:54
can i do ruby?
cs-264: 13:37:57
sure
cs-264: 13:38:10
should be a 1-liner in any language with list comprehensions, I think

cs-264: 13:41:44
any questions on implementing the mapper or reducer?
Lateral Punk: 13:47:42
do u need the ( ) ??
Lateral Punk: 13:47:46
in the output entry
cs-264: 13:48:38
mapper output should look like:
training@training-vm:~$ echo "foo foo quux labs foo bar quux" | ./mapper.pyfoo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1
training@training-vm:~$
Lateral Punk: 13:48:44
k
cs-264: 13:48:57
er, with a newline in there properly after the command
Lateral Punk: 13:49:22
i don't understand the reducer
Lateral Punk: 13:49:56
i got it
cs-264: 13:50:45
reducer is probably easiest to just accumulate a dictionary
Lateral Punk: 13:50:54
ok
Lateral Punk: 13:58:00
3 mins
cs-264: 14:03:12
for reference, I had to run:
hadoop jar $SJAR -mapper ~/mapper.py -reducer ~/reducer.py -input gutenberg -output gutenberg-out
cs-264: 14:03:24
so that hadoop got an absolute path to my code
Derek Schwenke: 14:10:42
In your example "foo foo quux labs foo bar quux" The output should start with:
Derek Schwenke: 14:10:47
foo 1
Derek Schwenke: 14:10:49
foo 1
Derek Schwenke: 14:10:54
Right?
cs-264: 14:11:10
yeah, skype munged the one newline when I pasted
cs-264: 14:11:36
reducer should like like:
training@training-vm:~$ echo "foo foo quux labs foo bar quux" | ./mapper.py | ./reducer.py
bar 1
foo 3
labs 1
quux 2
training@training-vm:~$
cs-264: 14:11:46
(if you sort, otherwise theordering might be diff)
cs-264: 14:28:15
any last questions?
cs-264: 14:28:36
were you able to get output when running on hadoop?
Lateral Punk: 14:28:49
i'm sitll a little behind
cs-264: 14:31:22
I guess head to the forums?