archive-edu.com » EDU » B » BERKELEY.EDU

Total: 1264

Choose link from "Titles, links and description words view":

Or switch to "Titles and links view".
  • People | AMPLab – UC Berkeley
    Hoyt graduate student Chi Jin graduate student Anurag Khandelwal graduate student Kevin Klues graduate student Sanjay Krishnan graduate student Gautam Kumar graduate student Nick Lanham graduate student Haoyuan Li graduate student Horia Mania graduate student Henry Milner graduate student Teodor Moldovan graduate student Philipp Moritz graduate student Robert Nishihara graduate student Frank Austin Nothaft graduate student Kay Ousterhout graduate student Anand Padmanabha Iyer graduate student Xinghao Pan graduate student Aurojit Panda graduate student Gene Pang graduate student Aishwarya Parasuram graduate student Qifan Pu graduate student Maxim Rabinovich graduate student Charles Reiss graduate student Rebecca Roelofs graduate student Geoffrey Schiebinger graduate student Kalyanaraman Shankari graduate student Virginia Smith graduate student Evan Sparks graduate student Liwen Sun graduate student Jonathan Terhorst graduate student Stephen Tu graduate student Shivaram Venkataraman graduate student Rashmi Vinayak graduate student Andre Wibisono graduate student Ashia Wilson graduate student Neeraja Yadwadkar graduate student Yuchen Zhang graduate student David Yu Zhu graduate student return to top Undergraduates Nicholas Lewis Altieri undergraduate Christopher Canel undergraduate Manu Goyal undergraduate Ibrahim Hamisu undergraduate Ananth Pallaseni undergraduate Vikram Sreekanti undergraduate Eric Tu undergraduate Zongheng Yang undergraduate return to top Staff Jeff Andersen Lee staff Kattt Atchley staff Carlyn Chinen staff Timothy Danford staff Gabriel Fierro staff Jonahluis Galvez staff Albert Goto staff Tamille Johnson staff Tomer Kaftan staff Shane Knapp staff Jey Kottalam staff Jon Kuroda staff Christian Legg staff Matt Massie staff Juan Sanchez staff Boban Zarkovich staff return to top Alumni Sameer Agarwal alumnus Gautam Altekar alumnus Ganesh Anantharayanan alumnus Michael Armbrust alumnus Alexandre Bayen alumnus Rifat Berk alumnus Oren Blasberg alumnus Denny Britz alumnus Tamara Broderick alumnus Yanpei Chen alumnus Allen Chen alumnus Jessica Chen alumnus Betty Beidi Chen alumnus undergraduate Tathagata Das alumnus graduate student Aaron Davidson alumnus John Duchi alumnus Cliff Engle alumnus Siamak Faridani alumnus Alan Fekete

    Original URL path: https://amplab.cs.berkeley.edu/people/ (2015-05-19)
    Open archived version from archive

  • Software | AMPLab – UC Berkeley
    GraphX MLBase Velox BlinkDB MLPipelines SparkSQL MLlib Spark Core Succint HDFS S3 Ceph Tachyon Mesos Hadoop Yarn AMPLab Developed Spark Community 3rd Party In Development Roadmap BDAS will continue to evolve over the life of the AMPLab project as existing components evolve and mature and new ones are added Community Software project Meetups Help organize monthly developer meetups around BDAS components Check out the Spark Shark meetup group the Mesos

    Original URL path: https://amplab.cs.berkeley.edu/software/ (2015-05-19)
    Open archived version from archive

  • AMP BLAB: The AMPLab Blog | AMPLab – UC Berkeley | Algorithms, Machines and People Lab
    Unlike some of the junkers on the track we ve spent a lot of time making the car fast reliable and competitive That being said over the course of a race usually 16 hours split over two days things that one wouldn t expect to fail do and sometimes in spectacular fashion In our most recent race Sears Pointless March 20 21 2015 at Sonoma Raceway our transmission decided to spray it s vital juices up through the gear shift gate and cover the entire interior of the car including the driver with a coating of smelly transmission fluid Our front brake rotors were extremely warped as well and under heavy braking the best kind the steering wheel would jerk from side to side and almost be ripped from our grasp Not to mention the severe fuel cutoff issues when we were below half a tank of gas Thankfully we were able to hold it together and finish the race We placed 11th out of 181 entries nearly cracking the top 10 This is some seriously cool open source hardware But what does this have to do with big data and open source The four team members are all involved in high tech consisting of Mozilla Level 3 and Google alums We are all about open source and love data And we collect in car telemetry data with an open source hard and software product from Autosport Labs called Race Capture Pro While we don t use Spark for data processing go go Google Sheets the data we collect is invaluable for helping us keep track of how the car and drivers perform during the race as well as post race bragging rights for who turned the fastest average lap sadly it wasn t me With this data we were able to analyze things like our average pit stop time 5 minutes 30 seconds over 6 stops each driver s average lap time in traffic Chris is the best and with an open track all four of us were within 6 seconds average over the entire weekend when turning fast laps which is kind of amazing These metrics show that everyone except for Chris needs to improve their race traffic management skills and that we need to bring our pit stop times down to at least 5 minutes to contend for a top 5 finish Our lap times are consistent and competitive proven with data and we know that the drivers and car are capable of more For a taste of how cool this data and product is check out our statistics from the race For some fun reading here s a preview of the most recent race and then coverage of the results And finally here is some on track action during my two hour long stint on the first day of racing Enjoy the wonderful sound of a rotary engine spinning up to nearly 9000rpm When Data Cleaning Meets Crowdsourcing Posted on March 6 2015 by jnwang The vision of AMPLab is to integrate A lgorithms Machine Learning M achines Cloud Computing and P eople Crowdsourcing together to make sense of Big Data In the past several years AMPLab has developed a variety of open source software components to fulfill this vision For example to integrate Algorithms and Machines AMPLab is developing MLbase a distributed machine learning system that aims to provide a declarative way to specify Machine Learning tasks One area we see great potential for adding People to the mix is Data Cleaning Real world data is often dirty inconsistent inaccurate missing etc Data analysts can spend over half of their time to clean data without doing any actual analysis On the other hand without data cleaning it is hard to obtain high quality answers from dirty data Crowdsourced data cleaning systems could help analysts clean data more efficiently and more cheaply which would significantly reduce the cost of the entire data analysis pipeline Table 1 An Example of Dirty Data with format error missing values and duplicate values In this post I will highlight two AMPLab research projects aimed in this direction 1 CrowdER a hybrid human machine entity resolution system 2 SampleClean a sample and clean framework for fast and accurate query processing on dirty data Entity resolution ER in database systems is the task of finding different records that refer to the same entity ER is particularly important when integrating data from multiple sources In such scenarios it is not uncommon for records that are not exactly identical to refer to the same real world entity For example consider the dirty data shown in Table 1 Records r1 and r3 in the table have different text in the City Name field but refer to the same city A simple method to find such duplicate records is to ask the crowd to check all possible pairs and decide whether each item in the pair refers to the same entity or not If a data set has n records this human only approach requires the crowd to examine O n 2 pairs which is infeasible for data sets of even moderate size Therefore in CrowdER we propose a hybrid human machine approach The intuition is that among O n 2 pairs of records the vast majority of pairs will be very dissimilar Such pairs can be easily pruned using a machine based technique The crowd can then be brought in to examine the remaining pairs Of course in practice there are many challenges that need to be addressed For example i how can we develop fast machine based techniques for pruning dissimilar pairs ii how can we reduce the crowd cost that is required to examine the remaining pairs For the first challenge we devise efficient similarity join algorithms which can prune dissimilar pairs from a trillion of pairs within a few minutes For the second challenge we identify the importance of exploiting transitivity to reduce the crowd cost and present a new framework for implementing this technique We evaluated CrowdER on several real world datasets where they are hard for machine based ER techniques Experimental results showed that CrowdER achieved more than 50 higher quality than these machine based techniques and at the same time it was several orders of magnitude cheaper and faster than human only approaches While crowdsourcing can make data cleaning more tractable it is still highly inefficient for large datasets To overcome this limitation we started the SampleClean project The project aims to explore how to obtain accurate query results from dirty data by only cleaning a small sample of the data The following figure illustrates why SampleClean can achieve this goal In the figure we compare the error in the query results returned by three query processing methods AllDirty does not clean any data and simply runs a query over the entire original data set AllClean first cleans the entire data set and then runs a query over the cleaned data SampleClean is our new query processing framework that requires cleaning only a sample of the data We can see that SampleClean can return a more accurate query result than AllDirty by cleaning a relatively small subset of data This is because SampleClean can leverage the cleaned sample to reduce the impact of data error on its query result but AllDirty does not have such a feature We can also see that SampleClean is much faster than AllClean since it only needs to clean a small sample of the data but AllClean has to clean the entire data set An initial version of the SampleClean system was demonstrated recently at AMPCamp 5 slides video We envision that the SampleClean system can add data cleaning and crowdsourcing capabilities into BDAS the Berkeley Data Analytics Stack and enable BDAS to become a more powerful software stack to make sense of Big Data Crowdsourcing is a promising way to scale up the inefficient process of cleaning data in the data analysis pipeline However crowdsourcing brings along with it a number of significant design challenges In this post I introduce CrowdER and SampleClean two AMPLab s research projects aimed at addressing this problem Of course there is a wide range of other open challenges to be researched in this area We have collected a list of recently published papers on related topics by groups around the world Interested readers can find them at this link AMP Camp 5 Posted on December 4 2014 by Ameet Talwalkar AMP Camp 5 was a huge success Over 200 people participated in this sold out event we had over 1800 views of our live stream from over 40 countries and we received overwhelmingly positive feedback In addition to learning about the Berkeley Data Analytics Stack BDAS participants were particularly interested in interacting with many of the lead developers of the BDAS software projects who gave talks about their work and also served as teaching assistants during the hands on exercises This 2 day event provided participants with hands on experience using BDAS the set of open source projects including Spark SparkSQL Gra phX and MLlib MLbase For the fifth installment of AMP Camp we expanded the curriculum to include the newest open source BDAS projects including Tachyon SparkR ML Pipelines and ADAM as well as a variety of research and use case talks Details about AMP Camp 5 including slides and videos from the talks as well as all of the training material for the hands on exercises are available on the AMP Camp website Aggressive Data Skipping for Querying Big Data Posted on October 23 2014 by Liwen Sun As data volumes continue to expand analytics approaches that require exhaustively scanning data sets become untenable For this reason we have been developing data organization techniques that make it possible to avoid looking at large volumes of irrelevant data Our work in this area which we call Aggressive Data Skipping recently got picked up by O Reilly Radar Data Today Behind on Predictive Analytics Aggressive Data Skipping More In this post I give a brief overview the approach and provide references to more detailed publications Data skipping is an increasingly popular technique used in modern analytics databases including IBM DB2 BLU Vertica Spark and many others The idea is very simple big data files are partitioned into fairly small blocks of say 10 000 rows For each such block we store some metadata e g the min and max of each column Before scanning each block a query can first check the metadata and then decide if the block possibly contains records that are relevant to the query If the metadata indicates that no such records are contained in the block then the block does not need to be read i e it can be skipped altogether In our work we focus on maximizing the amount of data that can be skipped hence the name Aggressive Data Skipping The key to our approach is Workload Analysis That is we observe the queries that are presented to the system over time and then make partitioning decisions based on what is learned from those observations Our workload driven fine grained partitioning framework re orders the rows at data loading time In order to maximize the chance of data skipping our research answers the following questions what partitioning method is appropriate for generating fine grained blocks what kind of concise metadata can we store for supporting arbitrary filters e g string matching or UDF filters As shown in the figure below our approach uses the following W A R P steps The Partitioning Framework Workload analysis We extract the frequently recurring filter patterns which we call the features from the workload The workload can be a log of past ad hoc queries or a collection of query templates from which daily reporting queries are generated Augmentation For each row we compute a bit vector based on the features and augment the row with this vector Reduce We group together the rows with the same bit vectors since the partitioning decision will be solely based on the bit vectors rather than the actual data rows Partition We run a clustering algorithm on the bit vectors and generate a partitioning scheme The rows will be routed to their destination partitions guided by this partitioning scheme After we have partitioned the data we store a feature bit vector for each partition as metadata The following figure illustrates how data skipping works during query execution Data skipping during query execution When a query comes our system first checks which features are applicable for data skipping With this information the query processor then goes through the partition level metadata i e the bit vectors and decides which partitions can be skipped This process can work in conjunction with existing data skipping based on min max We prototyped this framework on Shark and our experiments with TPC H and a real world dataset show speed ups of 2x to 7x An example result from the TPC H benchmark measuring average query response time over 80 TPC H queries is shown below Query Response Time For more technical details and results please read our SIGMOD 14 paper or if you hate formalism and equations we also gave a demo in VLDB 14 Feel free to send an email to liwen cs berkeley edu for any questions or comments on this project Big Data Hype the Media and Other Provocative Words to Put in a Title Posted on October 22 2014 by Michael Jordan I ve found myself engaged with the Media recently first in the context of a Ask Me Anything AMA with reddit com http www reddit com r MachineLearning comments 2fxi6v ama michael i jordan a fun and engaging way to spend a morning and then for an interview that has been published in the IEEE Spectrum That latter process was disillusioning Well perhaps a better way to say it is that I didn t harbor that many illusions about science and technology journalism going in and the process left me with even fewer The interview is here http spectrum ieee org robotics artificial intelligence machinelearning maestro michael jordan on the delusions of big data and other huge engineering efforts Read the title and the first paragraph and attempt to infer what s in the body of the interview Now go read the interview and see what you think about the choice of title Here s what I think The title contains the phrase The Delusions of Big Data and Other Huge Engineering Efforts It took me a moment to realize that this was the title that had been placed without my knowledge on the interview I did a couple of weeks ago Anyway who knows me or who s attended any of my recent talks knows that I don t feel that Big Data is a delusion at all rather it s a transformative topic one that is changing academia e g for the first time in my 25 year career a topic has emerged that almost everyone in academia feels is on the critical path for their sub discipline and is changing society most notably the micro economies made possible by learning about individual preferences and then connecting suppliers and consumers directly are transformative But most of all from my point of view it s a major engineering and mathematical challenge one that will not be solved by just gluing together a few existing ideas from statistics optimization databases and computer systems I e the whole point of my shtick for the past decade is that Big Data is a Huge Engineering Effort and that that s no Delusion Imagine my dismay at a title that said exactly the opposite The next phrase in the title is Big Data Boondoggles Not my phrase nor my thought I don t talk that way Moreover I really don t see anything wrong with anyone gathering lots of data and trying things out including trying out business models quite to the contrary It s the only way we ll learn Indeed my bridge analogy from later in the article didn t come out quite right I was trying to say that historically it was crucial for humans to start to build bridges and trains etc etc before they had serious engineering principles in place the empirical engineering effort had immediate positive effects on humans and it eventually led to the engineering principles My point was just that it s high time that we realize that wrt to Big Data we re now at the what are the principles point in time We need to recognize that poorly thought out approaches to large scale data analysis can be just costly as bridges falling down E g think individual medical decision making where false positives can and already are leading to unnecessary surgeries and deaths Next in the first paragraph I m implied to say that I think that neural based chips are likely to prove a fool s errand Not my phrase nor my thought I think that it s perfectly reasonable to explore such chip building it s even exciting As I mentioned in the interview I do think that a problem with that line of research is that they re putting architecture before algorithms and understanding and that s not the way I d personally do things but others can beg to differ and by all I means think that they should follow their instincts The interview then proceeds along with the interviewer continually trying to get me to express black and white opinions about issues where the only reasonable response is gray and where my overall message that Big Data is Real but that It s a Huge Engineering Challenge Requiring Lots of New Ideas and a Few Decades of Hard Work keeps getting lost but where I valiantly I hope resist When we got to the Singularity and quantum computing though areas where

    Original URL path: https://amplab.cs.berkeley.edu/blog/ (2015-05-19)
    Open archived version from archive

  • Sponsors | AMPLab – UC Berkeley
    Affiliates This research is supported in part by NSF CISE Expeditions Award CCF 1139158 LBNL Award 7076018 and DARPA XData Award FA8750 12 2 0331 and gifts from Amazon Web Services Google SAP The Thomas and Stacey Siebel Foundation Adatao Adobe Apple Inc Blue Goji Bosch C3Energy Cisco Cray Cloudera EMC2 Ericsson Facebook Guavus HP Huawei Informatica Intel Microsoft NetApp Pivotal Samsung Schlumberger Splunk Virdata and VMware Share this Twitter

    Original URL path: https://amplab.cs.berkeley.edu/amp-sponsors/ (2015-05-19)
    Open archived version from archive

  • AMPLab Summer Retreat, May 18-20, 2015, Chaminade, Santa Cruz, CA (By Invitation Only) | AMPLab – UC Berkeley
    Santa Cruz to discuss the latest results from AMPLab projects and to outline the roadmap for the lab and the BDAS stack going forward Attendees include all of the lab students researchers and faculty as well as representatives from all of our Industrial Sponsors It makes for one of the most informative interactive and yes exclusive Big Data events of the year Details will be sent soon but do mark

    Original URL path: https://amplab.cs.berkeley.edu/event/amplab-summer-retreat-may-18-20-2015-chaminade-santa-cruz-ca-by-invitation-only/ (2015-05-19)
    Open archived version from archive

  • [Dissertation Talk] Mosharaf Chowdhury, AMPLab, Mending The Application-Network Gap in Big Data Analtyics, Friday 5/15, 9am, 465 Soda | AMPLab – UC Berkeley
    and Computer Sciences EECS With the rapid rise of cloud computing scale out applications running on large clusters are becoming the norm While the diversity of applications and the capacity of datacenters are continuously growing application and network level goals are moving further apart For example the duration of a shuffle the communication stage of a MapReduce application is determined by the completion time of its last flow This means that one can improve the shuffle completion time by slowing down the smaller flows and allocating the extra bandwidth to speed up the larger flows However today s application agnostic networks treat each flow independently resulting in suboptimal application level performance In this talk I will present the coflow abstraction that bridges this gap by exposing the performance goals of data parallel applications to the network For example a coflow can capture the semantics of a shuffle By leveraging application level semantics coflows allow us to improve the communication performance of individual applications across multiple applications and in the presence of dynamic events like task failures and speculative executions I will also describe the design decisions behind Varys a system that enables applications to take advantage of coflow scheduling without

    Original URL path: https://amplab.cs.berkeley.edu/event/disseration-talk-mosharaf-chowdhury-amplab-mending-the-application-network-gap-in-big-data-analytics-f-515-9am-465-soda/ (2015-05-19)
    Open archived version from archive

  • [Dissertation Talk] Gene Pang, AMPLab, Scalable Transactions for Scalable Distributed Databases, W 5/13, noon, 405 Soda | AMPLab – UC Berkeley
    database systems capable of handling the great demands are critical for applications With the emergence of cloud computing a major movement in the industry modern applications depend on distributed data stores for their scalable data management solutions Many large scale applications utilize NoSQL systems such as distributed key value stores for their scalability and availability properties over traditional relational database systems By simplifying the design and interface NoSQL systems can provide high scalability and performance for large data sets and high volume workloads However to provide such benefits NoSQL systems sacrifice traditional consistency models and support for transactions typically available in database systems Without transaction semantics it is harder for developers to reason about the correctness of the interactions with the data Therefore it is important to support transactions for distributed database systems without sacrificing scalability In this talk I present new techniques for scalable transactions for scalable database systems Distributed data stores need scalable transactions to take advantage of cloud computing and to meet the demands of modern applications Traditional techniques for transactions may not be appropriate in a large distributed environment so in this work I describe new techniques for distributed transactions without having to sacrifice traditional semantics

    Original URL path: https://amplab.cs.berkeley.edu/event/dissertation-talk-gene-pang-amplab-scalable-transactions-for-scalable-distributed-databases-w-513-noon-405-soda/ (2015-05-19)
    Open archived version from archive

  • Tachyon: A Reliable Memory Centric Distributed Storage System | AMPLab – UC Berkeley
    The result of over two years of research Tachyon achieves memory speed and fault tolerance by using memory aggressively and leveraging lineage information Tachyon caches working set files in memory and enables different jobs queries and frameworks to access cached files at memory speed Thus Tachyon avoids going to disk to load datasets that are frequently read Tachyon is Hadoop compatible Existing Spark and MapReduce programs can run on top of it without any code changes Tachyon is the default off heap option in Spark which means that RDDs can automatically be stored inside Tachyon to make Spark more resilient and avoid GC overheads The project is open source and is already deployed at multiple companies In addition Tachyon has more than 50 contributors from over 20 institutions including Yahoo Redhat Nokia Intel Databricks Cloudera Alibaba etc The project is the storage layer of the Berkeley Data Analytics Stack BDAS and also part of the Fedora distribution Collaborating with our industry partners the lab is continuously enhancing the system and developing exciting things around it For more information please visit the Tachyon website The source code can be obtained from the project s Github page We also host regular meetup at the Bay Area Projects Akaros An operating system for many core architectures and large scale SMP systems BLB Bootstrapping Big Data Cancer Tumor Genomics Fighting the Big C with the Big D Carat Collaborative Detection of Energy Bugs Concurrency Control for Machine Learning CrowdDB Answering Queries with Crowdsourcing DFC Divide and Conquer Matrix Factorization DNA Processing Pipeline DNA Sequence Alignment with SNAP GraphX Large Scale Graph Analytics MDCC Multi Data Center Consistency Mesos Dynamic Resource Sharing for Clusters MLbase Distributed Machine Learning Made Easy PIQL Scale Independent Query Processing Real Life Datacenter Workloads SampleClean Fast and Accurate Query Processing

    Original URL path: https://amplab.cs.berkeley.edu/projects/tachyon-reliable-file-sharing-at-memory-speed-across-cluster-frameworks/ (2015-05-19)
    Open archived version from archive