by DhrubaBorthakur
Recently, I visited a few premier educational institutes in India,e.g. Indian Institute of Technology (IIT) at Delhi and Guwahati. Most of theundergraduate students at these two institutes are somewhat familiar withHadoop and would like to work on Hadoop related projects as part of theircourse work. One commonly asked question that I got from these students is whatHadoop feature can I work on?
Here are some items that I have in mind that are good topics forstudents to attempt if they want to work in Hadoop.
Ability to make Hadoop schedulerresource aware, especially CPU, memory and IO resources. The currentimplementation is based on statically configured slots.
Abilty to make a map-reduce jobtake new input splits even after a map-reduce job has already started.
Ability to dynamically increasereplicas of data in HDFS based on access patterns. This is needed tohandle hot-spots of data.
Ability to extend the map-reduceframework to be able to process data that resides partly in memory. Oneassumption of the current implementation is that the map-reduce frameworkis used to scan data that resides on disk devices. But memory on commoditymachines is becoming larger and larger. A cluster of 3000 machines with 64GB each can keep about 200TB of data in memory! It would be nice if thehadoop framework can support caching the hot set of data on the RAM of thetasktracker machines. Performance should increase dramatically because itis costly to serialize/compress data from the disk into memory for everyquery.
Heuristics to efficiently'speculate' map-reduce tasks to help work around machines that arelaggards. In the cloud, the biggest challenge for fault tolerance is notto handle failures but rather anomalies that makes parts of the cloud slow(but not fail completely), these impact performance of jobs.
Make map-reduce jobs work acrossdata centers. In many cases, a single hadoop cluster cannot fit into asingle data center and a user has to partition the dataset into two hadoopclusters in two different data centers.
High Availability of theJobTracker. In the current implementation, if the JobTracker machine dies,then all currently running jobs fail.
Ability to create snapshots inHDFS. The primary use of these snapshots is to retrieve a dataset that waserroneously modified/deleted by a buggy application.
The first thing for a student who wants to do any of theseprojects is to download the code from HDFS andMAPREDUCE. Thencreate an account in the bug tracking softwarehere. Please search for an existing JIRA that describes your project; if noneexists then please create a new JIRA. Then please write a design documentproposal so that the greater Apache Hadoop community can deliberate on theproposal and post this document to the relevant JIRA.