Skip to main content

What Do Data Scientists Do? - Joma Tech

Data science is not about making complicated models. It's not about making awesome visualizations It's not about writing code data science is about using data to create as much impact as possible for your company Now impact can be in the form of multiple things It could be in the form of insights in the form of data products or the form of product recommendations for a company 

Now to do those things, then you need tools like making complicated models or data visualizations or writing code But essentially as a data scientist your job is to solve real company problems using data and what kind of tools you use we don't care Now there's a lot of misconception about data science, especially on YouTube and I think the reason for this is because there's a huge misalignment between what's popular to talk about and what's needed in the industry. 

So because of that, I want to make things clear. I am a data scientist working for a GAFA company and those companies emphasize on using data to improve their products So this is my take on what is data science Before data science, we popularized the term data mining in an article called from data mining to knowledge discovery in databases in 1996 in which it referred to the overall process of discovering useful information from data In 2001, William S. Cleveland wanted to bring data mining to another level He did that by combining computer science with data mining He made statistics a lot more technical which he believed would expand the possibilities of data mining and produce a powerful force for innovation Now you can take advantage of computing power for statistics and he called this combo data science. Around this time this is also when web 2.0 emerged where websites are no longer just a digital pamphlet, but a medium for a shared experience amongst millions and millions of users 

These are web sites like MySpace in 2003 Facebook in 2004 and YouTube in 2005. We can now interact with these web sites meaning we can contribute post comments like upload share leaving our footprint in the digital landscape we call the Internet and help create and shape the ecosystem we now know and love today. And guess what? That's a lot of data so much data, it became too much to handle using traditional technologies. So we call this Big Data. That opened a world of possibilities in finding insights using data But it also meant that the simplest questions require sophisticated data infrastructure just to support the handling of the data We needed parallel computing technology like MapReduce, Hadoop, and Spark so the rise of big data in 2010 sparked the rise of data science to support the needs of the businesses to draw insights from their massive unstructured data sets So then the journal of data science described data science as almost everything that has something to do with data Collecting analyzing modeling. 

Yet the most important part is its applications. All sorts of applications. Yes, all sorts of applications like machine learning So in 2010 with the new abundance of data it made it possible to train machines with a data-driven approach rather than a knowledge-driven approach. All the theoretical papers about recurring neural networks support vector machines became feasible Something that can change the way we live and how we experience things in the world Deep learning is no longer an academic concept in these thesis paper It became a tangible useful class of machine learning that would affect our everyday lives So machine learning and AI dominated the media overshadowing every other aspect of data science like exploratory analysis, experimentation, ... And skills we traditionally called business intelligence So now the general public think of data science as researchers focused on machine learning and AI but the industry is hiring data scientists as analysts So there's a misalignment there The reason for the misalignment is that yes, most of these data scientists can probably work on more technical problems but big companies like Google Facebook Netflix have so many low-hanging fruits to improve their products that they don't require any advanced machine learning or the statistical knowledge to find these impacts in their analysis 

Being a good data scientist isn't about how advanced your models are It's about how much impact you can have with your work. You're not a data cruncher. You're a problem solver You're strategists. Companies will give you the most ambiguous and hard problems. And we expect you to guide the company in the right direction 

Ok, now I want to conclude with real-life examples of data science jobs in Silicon Valley But first I have to print some charts. So let's go do that (conversation not directly related to the topic) (conversation not directly related to the topic) So this is a very useful chart that tells you the needs of data science. Now, it's pretty obvious but sometimes we kind of forget about it now At the bottom of the pyramid we have collected you have to collect some sort of data to be able to use that data So collect storing transforming all of these data engineering efforts is pretty important and it's quite captured pretty well in media because of big data we talked about how difficult it is to manage all this data We talked about parallel computing which means like Hadoop and Spark Stuff like that. We know about this. Now the thing that's less known is the stuff in between which is right here everything that's here and Surprisingly this is one of the most important things for companies because you're trying to tell the company what to do with your product. So what do I mean by that? So I'm analytics that tells you using the data what kind of insights can tell me what is happening to my users and then metrics this is important because what's going on with my product? You know, these metrics will tell you if you're successful or not. 

And then also, you know a be testing of course Experimentation that allows you to know, which product versions are the best So these things are actually really important but they're not so covered in media. What's covered in media is this part. AI, deep learning. We've heard it on and on about it, you know But when you think about it for a company, for the industry, It's not the highest priority or at least it's not the thing that yields the most result for the lowest amount of effort That's why AI deep learning is on top of the hierarchy of needs and these things may be testing analytics they're way more important for the industry so that's why we're hiring a lot of data scientists that do that. So what do data scientists do? Well, that depends on the company because of them as of the size So for a start-up you kind of lack resources So you can only kind of have one DS. So that one data scientist he has to do everything. So you might be seeing all this being data scientists. Maybe you won't be doing AI or deep learning because that's not a priority right now But you might be doing all of these. You have to set up the whole data infrastructure 

You might even have to write some software code to add logging and then you have to do the analytics yourself, then you have to build the metrics yourself, and you have to do A/B testing yourself. That's why for startups if they need a data scientist this whole thing is data science, so that means you have to do everything. But let's look at medium-sized companies. Now, finally, they have a lot more resources. They can separate the data engineers and the data scientists So usually in the collection, this is probably software engineering. And then here, you're gonna have data engineers doing this. And then depending on if you're medium-sized company does a lot of recommendation models or stuff that requires AI, then DS will do all these Right. So as a data scientist, you have to be a lot more technical That's why they only hire people with PhDs or masters because they want you to be able to do the more complicated things So let's talk about the large company now Because you're getting a lot bigger you probably have a lot more money and then you can spend it more on employees So you can have a lot of different employees working on different things. 

That way the employee does not need to think about this stuff that they don't want to do and they could focus on the things that they're best at. For example, me and my untitled large company I would be in analytics so I could just focus my work on analytics and metrics and stuff like that So I don't need to worry about data engineering or AI deep learning stuff So here's how it looks for a large company Instrumental logging sensors. This is all handled by software engineers Right? And then here, cleaning and building data pipelines This is for data engineers. Now here, between these two things, we have Data Science Analytics. That's what it's called But then once we go to AI and deep learning, this is where we have research scientists or we call it data science core, and they are backed by and now engineers which are machine learning engineers. 

Yeah Anyways, so in summary, as you can see, data science can be all of this and it depends on what company you are in And the definition will vary. So please let me know what you would like to learn more about AI deep learning, or A/B testing, experimentation,... Depending on what you want to learn about leave a comment down below so I could talk about it or I could find someone who knows about this and I can share the insights with you So yeah, if you like this video, don't forget to like and subscribe 

So, yeah. Hope you have a wonderful day. 

Hope this was helpful.

Comments

Popular posts from this blog

Data Is the New Oil : From Lenses Of Oil & Gas Industry

Yes, times are difficult but that's what opportunity seekers make the best out of. This pandemic has resulted in a lot of unexpected changes. Most of us don't have a plan anymore because it seems hard to believe when and how our lives would resume. What will the new normal be? Is it ever getting back to the same? Will I ever be able to live my good old life again? How will demand for skills change in near future? What shall we do to maximize the gain from this slow-paced life? With so many questions in their mind, I had an opportunity to talk to a wonderful group of audience in a webinar organized by EAGE RGPIT SC . When I asked them "What is in your mind?"  This was their response. People are worried about their careers. With fancy data lingos, everyone is seeking to learn more about them and trying to prepare for a secured tomorrow. Data science has become more and more popular over the last decade. Being a data scientist is now a software engineer of yesterday. Eve

Shell.AI Residency Program India

This is to call for applications to " AI Residency Program "   recently launched by Shell India. This  a 2-year, full-time, immersive programme, which allows data scientists, AI engineers and computational scientists to gain experience working on a variety of AI projects across all Shell businesses. Recently, Shell India has launched its own, specialized and global programme – bringing digitization – in India to newer heights. Join us, and make history through influencing the future of energy. Along with this, we are also conducting a  hackathon for sustainable and affordable energy  which gives the winners a direct entry to interviews for Shell AI Residency Program.  PS: Jobs will only be offered to people with relevant experience as mentioned in the page while hackathon is open to everyone. Participate and be a part of Journey to a cleaner and more sustainable tomorrow. 

pip install xgboost

I have tried a million ways over years to by-pass all the certs/securities but never had a one right way to do this, xgboost is a very popular ml algorithm but been hard to install. This time finally I made it happen. I downloaded the wheel file directly and installed it to make this happen right. I also have a video for this but this article shows step by step process on how to do it right. Here is how: Open command window from start menu in windows: Fig1: Open Command Window Go to the website to find unofficial binaries 👉 here , find the desired .whl file, in this case, we are looking for xgboost, and download the compatible version with your machine and python: Fig2: The Unofficial Binaries Locate the downloaded file on your machine: Fig3: Locate the Files Install from cmd using pip: Fig4: Install the file  And you are done, you can also follow these steps from my video here: