What is Big Data and Hadoop?



hi I'm bill Appleby and today in seven minutes flat I'm going to explain how Hadoop works and what you can do with it and what Big Data is I've done a lot of Big Data projects in Australia in Canada in the United States and I'm also a learning tree instructor okay so why Big Data firstly we all know that governments and businesses are all gathering lots of data these days movies images transactions but why the answer is that data is incredibly valuable analyzing all data lets us do things like detect fraud going years back these days to disk is cheap we can afford to keep all that data but there's a catch all that data won't fit anymore on a single processor or a single disk so we have to distribute it across thousands of nodes but there's a good side to that if it's distributed and we run in parallel we can compute thousands of times faster and do things we couldn't possibly do before and that's the trick behind Hadoop okay how does it work suppose what I wanted to do is look for an image spread across many hundreds of files so first off a dupe has to know where that data is it goes and query something called a name know to find out all the places where the data file is located once it's figured that out it sends your job out to each one of those nodes each one of those processors independently reads its input file each one of them looks for the image rights the results out to a local output file that's all done in parallel when they all report finished you're done oK we've seen one simple example of what you might want to do with a do image recognition but there's a lot more to it than that for example I can do statistical data analysis I might want to calculate means averages correlations all sorts of other data for example I might want to look at unemployment versus population versus income versus states if I have all the data in Hadoop I can do that I can also do machine learning and all sorts of other analysis once you've got the data in Hadoop is almost no limit to what you can do oK we've seen that in Hadoop data is always distributed both the input and the output there's more to it than that the data is also replicated copies are kept up all the data blocks so that if one node falls over it doesn't affect the result that's how we get reliability but sometimes we need to communicate between nodes it's not enough that everybody processes their local data alone an example is counting or sorting in that case communication is required and the Hadoop trick for that is called Map Reduce let's look at an example of how Map Reduce works what we're going to do is take a little application called count dates that counts the number of times a date occurred spread across many different files the first phase is called the matte phase each process or that has an input file reads the input file in counts the number of times those dates occurred and then writes it as a set of key value pairs after that's done we have what's called the shuffle phase Hadoop automatically sends all of the 2000 data to one processor the 2001 data to another processor the 2002 data to another process or after that shuffle phase is complete we can do what's called a reduce in the reduced phase all of the 2000 data is summed up and written to the output file when everybody is complete with their summations a report done and the job is done oK we've seen a couple of great examples of how Hadoop works the next question is how does Hadoop compared to conventional relational databases because they've dominated the market for years we've seen one big difference which is that in Hadoop data is distributed across many nodes and the processing of that data is distributed by contrast in a conventional relational database conceptually all the data sets on one server in one database but there are more differences than that the biggest difference is that in Hadoop data is right once read many in other words once you've written data you're not allowed to modify it you can delete it but you can't modify it by contrast in relational databases data can be written many times like the balance of your account but in our cable data which Hadoop is optimized for once you've written the data you don't want to modify it if it's archival data about telephone calls or transactions you don't want to change it once you've written it there's another difference too in relational databases we always use sequel by contrast Hadoop doesn't support sequel at all it supports lightweight versions of sequel called no sequel but not conventional sequel also Hadoop is not just a single product or platform it's a very rich ecosystem of tools and technologies and platforms almost all of which are open source and all work together so what's in the Hadoop ecosystem at the lowest level a dip just runs on commodity hardware and software you don't need to buy any special hardware it runs on many operating systems on top of that is the Hadoop layer which is MapReduce and the Hadoop distributed file system on top of that is a set of tools and utilities such as our Hadoop which is statistical data processing using the r programming language there is a machine learning tool there are also tools for doing no sequel like hive and pig and the neat thing about those tools is they support semi-structured or unstructured data you don't have to have your data stored in a conventional schema instead you can read the data and figure out the schema as you go along finally we have tools for getting data into and out of the Hadoop file system like Squibb that ecosystem is constantly evolving so for example there's now a new tool for managing the pig tool called lipstick on a pig and there are many more and that environment keeps being added to all the time so now we've seen how it works and what it can do I'm sure you've got more questions than that such as how do I install Hadoop and on what platforms the differences between different Hadoop versions or how to do extract transform and load in Hadoop answers to those questions are on our website at the following URL I really hope you've enjoyed this video take care Cheers

46 Replies to “What is Big Data and Hadoop?”

  1. Hi! I suggest adding a description of new data visualization tools for your users. Try out the new AtomicusChart data visualization tool https://atomicuschart.com get a free trial for 3 months if U need

  2. God bless Indians for being so smart but I don't understand a thing they're saying so I am glad they uploaded this video

  3. I trust this channel rather than others. you are very clear and easy explain, that's what I love it. https://www.traininginbangalore.com/hadoop-training-in-bangalore/

  4. Very educational video. There is no alternative to Big Data than Hadoop and Hadoop is gonno stay for decades. adoption of Hadoop by the companies will increase due to increase in BIg Data. Read this article on Big Data careers and jobs role to understand its future scope on data-flair.training/blogs/careers-job-roles-big-data-comprehensive-guide/

  5. Learn Big Data and Hadoop here! facebook.com/notes/affiliate-marketing-success/why-big-data-and-hadoop-training-is-a-must-for-organizations/255188601693770/

  6. Good video for freshers. Visit on https://bit.ly/2GST7Ok for further information.

  7. Thanks a lot very much for the high your blog post quality and results-oriented help. I won’t think twice to endorse to anybody who wants and needs support about this area. https://www.besanttechnologies.com/training-courses/data-warehousing-training/big-data-hadoop-training-institute-in-bangalore

  8. 90% off !! #udemy #course #coupon
    #Hive to #ADVANCE Hive (Real time usage) :#Hadoop querying #tool for #Interviews

    https://www.udemy.com/hadoop-querying-tool-hive-to-advance-hivereal-time-usage/?couponCode=LANCE10

  9. 0:29 why big data
    1:21 how hadoop works
    1:59 what can you do with hadoop
    2:37 more on how hadoop works
    3:19 how MapReduce works
    4:27 hadoop vs conventional databases
    6:16 hadoop ecosystem
    7:32 what's next

Leave a Reply

Your email address will not be published. Required fields are marked *