WebAlgos
From ResearchWiki
Computer Science on the Web
Contents |
Organizing Principle
Topics should cover things that a student is likely to encounter. For example, the problems of computer networking and Internet routing are fundamental, but they don't directly impinge on the user (except in the case of wireless hijacking, botnets, or DDOS attacks)
General class structure
- Java data mining toolkit
- Three days a week
- each week focuses on one aspect of the 'web life', exploring the underlying computational primitives that drive it.
- the idea is to have an assignment a week, where students explore some aspect of the topic under discussion.
- if the assignments are all programming, this may be a bit heavy (maybe program every other week?)
- the general heuristic i've seen is that in CS1 you write 50-150 line programs, in CS2 100-250, but usually over two weeks
- course projects? pick a topic you really like and run with it?
- need a platform for demoing solutions: a web page, maybe ?
- web page seems ideal (given the theme), but of course locks in a language (probably java, since they know it and it's okay-ish for web stuff, although you could take this as an excuse to teach a scripting language -- eg., ruby -- which has some support in the department, eg., Erik)
- (Suresh) I'm confused: why does a web page lock in a platform: on the contrary, you can do whatever you want in the backend, as long as the server can handle it
- (Hal) you're right... though if we provide any infrastructure code then this would lock it... i also kind of feel like the "use whatever language you want" model might be too much for sophomores to handle :).
- see joe's web software architecture course for hardcore how to program on the web
- (Suresh) It's a pity, because Joe's class is offered in the spring, but it would have been interesting to run these simultaneously, or even run WebAlgos as a followup to his class. That would take care of some of programming dev issues
- web page seems ideal (given the theme), but of course locks in a language (probably java, since they know it and it's okay-ish for web stuff, although you could take this as an excuse to teach a scripting language -- eg., ruby -- which has some support in the department, eg., Erik)
- target audience?
- first free elective is 4th semester (page 6)
- at that point, they've taken CS1+2, 3500, discrete and comp org (bits and bytes class), plus calc 1 and 2
- earlier than that (eg., 2nd semester) seems hard: no free electives, no cs2, no discrete
- is it worth effort to make it more broad? (i can imagine it could be fairly appealing outside the dept)
- (suresh) Olivia Sheng's group might be very interested. Maybe we should show this to her.
- (hal) agreed... once we work things out a bit more
Possible list of topics
Finding Stuff
Google and Pagerank
Discuss the underlying linear algebra, eigenvalues and random walk concepts.
- Primitives: graphs, probability, random walks
- Algorithms: power method, matrix multiplication (key idea is that viewing the problem as a random walk immediately suggests an algorithm that might not be otherwise obvious!)
- Readings: Posts I and II by Michael Nielsen explaining Pagerank
- Possible projects:
- Implement PageRank on UofU SoC subnet?
- Perhaps tie together with dumb search from project for text searching
- MapReduce in the cloud
Text Searching
basic concepts of IR, vector space model
- Primitives: boolean search, bag of words model, vector space similarity
- Algorithms: inverted index (allows for O(1) fetching rather than O(log n) or O(n))
- Readings: can probably get sth from the IR book
- Possible projects:
- Implement boolean search and vector space model (duh!)
Crawling
link traverals, caching, duplicate detection and farms
- Primitives: graphs, caching
- Algorithms: maybe focus on dup detection? there's lots of cool randomized algorithm stuff here...
- Readings: ?
- Possible projects:
- Crawl cs.utah.edu?
- Do duplicate detection on some data set (probably not cs.utah.edu...)
Buying stuff
Netflix and Recommendations
introduce data mining, collaborative filtering
- Primitives: data as a matrix
- Not sure: there are so many ways to look at this.... low rank factorization (eg., SVD or NNMF), classification, regression, non-ML approaches, etc...
Facebook and Social networks
How to analyze social network structure for fun and profit
- I wonder if we can get some source of social network data from yahoo/google/facebook
- One possible organizing question: can you find 'communities' on the web:
- modularity as a key measure of society in a graph
- the enron graph data and using anomalies in communication to detect unusual phenomena:
- Twitter API for social networking.
Amazon and Secure transactions
- Sneha runs a network security class: we could ask him if he has any useful projects with code that we can adapt
Ebay and Auctions
- We can have students break up into teams and run something like a fake wireless spectrum auction (much like the one that actually happened a few years ago)
- Papers by Larry Ausubel (expert on this topic)
Web Advertising
Google adwords
Storing and distributing stuff
MapReduce and Massive Data
The mapreduce principle, Hadoop, and the challenges of dealing with massive data. The Disco programming environment. Elastic MapReduce
Cloud Computing/GFS/S3
Efficient distributed storage for fast retrieval
ZIP and GZIP and 1-pixel cameras
Keeping file sizes down (compression, compressed sensing)
Bittorrent and P2P networking
Basic principles of distributed caching, Chord etc
Akamai and content distribution
how to build distributed hash tables
Yahoo Pipes/Mashups
XML/RSS and the problem of data exchange
Attacking stuff
Sentiment Analysis
As below.
Spam and Spam filters
Simple naive Bayes methods for spam filtering
- Primitives: bag of words, probability
- Algorithms: maximum likelihood estimation (ok not really an algorithm :P)
- Readings: ?? lots of stuff
- Possible projects:
- Spam filter!
MP3s and DRM
Steganography (i.e digital watermarking)
Web SPAM
CAPTCHA and breaking CAPTCHAs with learning
Other Random Topics, Uncategorized
- Language on the web (eg., machine translation, information extraction)
- Wireless networks (information theory, coding)
- Turking (problem design? not sure?)
- I don't know if Turking involves any actual CS questions though:
- yeah, i can't think of anything either... though it could be tied with CAPTCHA (human computation) stuff...
- Crawling?
- Crawling is a good idea: it forces you to understand things like 'what's a duplicate web page', 'what's linkfarming', etc....