Saturday, August 17, 2019

What You Need to Learn to Become a Data Scientist


What You Need to Learn to Become a Data Scientist


1 Introduction
This next section covers all of the data science skills you’ll need to learn. You’ll also learn about the tools you need to do your job.

Most data scientists use a combination of skills every day, some of which they have taught themselves on the job or otherwise. They also come from various backgrounds. There isn’t any one specific academic credential that data scientists
are required to have.


There isn’t any one specific academic credential that data
scientists are required to have.

All the skills we discuss are things you can teach yourself or learn with a mentor. We’ve laid out some resources to get you started down that path

2 Data Science Skills

2.1 An Analytical Mind

Takeaway

You need to approach data science problems analytically to solve them.

You’ll need an analytical mindset to do well in data science.


You’ll need an analytical mindset to do well in data science.

A lot of data science involves solving problems. You’ll have to be adept at framing those problems and methodically applying logic to solve them.


2.2 Mathematics

Takeaway
Mathematics is an important part of data science. Make sure you  know the basics of university math from calculus to linear algebra. The more math you know, the better.






Mathematics is an important part of data science.

When data gets large, it often gets unwieldy. You’ll have to use mathematics to process and structure the data you’re dealing with.

You won’t be able to get away from knowing calculus, and linear algebra if you missed those topics in undergrad. You’ll need to understand how to manipulate matrices of data and get a general idea behind the math of algorithms.


2.3 Statistics

Takeaway

You must know statistics to infer insights from smaller data sets onto larger populations. This is the fundamental law of data science.


You must know statistics to infer insights from smaller
data sets onto larger populations.


You need to know statistics to play with data. Statistics allows you to slice and dice through data, extracting the insights you need to make reasonable conclusions.
Understanding inferential statistics allows you to make general conclusions about everybody in a population from a smaller sample.


To understand data science, you must know the basics of hypothesis testing, and experiment design in order to understand where the meaning and context of your data.

2.4 Algorithms

Takeaway
Algorithms are the ability to make computers follow a certain set of rules or patterns. Understanding how to use machines to do your work is essential to processing and analyzing data sets too large for the human mind to process.


Understanding how to use machines to do your work is
essential to processing and analyzing data sets too large
for the human mind to process.

In order for you to do any heavy lifting in data science, you’ll have to understand the theory behind algorithm selection and optimization. You’ll have to decide whether or not your problem demands a regression analysis, or an algorithm that helps classify different data points into defined categories.

You’ll want to know many different algorithms. You will want to learn the fundamentals of machine learning. Machine learning is what allows for Amazon to recommend you products based on your purchase history without any direct human intervention. It is a set of algorithms that will use machine power to unearth insights for you.

In order to deal with massive data sets you’ll need to use machines to extend your thinking.


2.5 Data Visualization

Takeaway

Finishing your data analysis is only half the battle. To drive impact, you will have to convince others to believe and adopt your insights.


To drive impact, you will have to convince others to believe and adopt your insights.

Human beings are visual creatures. According to 3M and Zabisco, almost 90% of the information transmitted to your brain is visual in nature, and visuals are processed 60,000 times faster than text.

Human beings have been wired to respond to visual cues. You’ll need to find a way to convey your insights accordingly.

2.6 Business Knowledge

Takeaway

Data means little without its context. You have to understand the business you’re analyzing.


Data means little without its context.

Most companies depend on their data scientists not just to mine data sets, but also to communicate their results to various stakeholders and present recommendations that can be acted upon.

The best data scientists not only have the ability to work with large, complex data sets, but also understand intricacies of the business or organization they work for.

Having general business knowledge allows them to ask the right questions, and come up with insightful solutions and recommendations that are actually feasible given any constraints that the business might impose.

2.7 Domain Expertise

Takeaway

As a data scientist, you should know the business you work for and the industry it lives in.

Beyond having deep knowledge of the company you work for, you’ll also have to understand its field your insights to make sense. Data from a biology study can have a drastically different context than data gleaned from a well-designed







psychology study. You should know enough to cut through industry jargon.


3 Data Science Tools

With your skill set developed, you’ll now need to learn how to use modern data science tools. Each tool has their strengths and weaknesses, and each plays a different role in the data science process. You can use just one of them, or you can use all of them. What follows is a broad overview of the most popular tools in data science as well as the resources you’ll need to learn them properly if you want to dive deeper.


3.1 File Formats

Data can be stored in different file formats. Here are some of the most common:

CSV
Comma separated values. You may have opened this sort of file with Excel before. CSVs separate out data with a delimiter, a piece of punctuation that serves to separate out different data points.

SQL
SQL, or structured query language, stores data in relational tables. If you go from the right from a column to the left, you’ll get different data points on the same entity (for example, a person will have a value in the AGE, GENDER, and HEIGHT categories).

JSON 
Javascript Object Notation is a lightweight data exchange format that is both human and machine-readable. Data from a web server is often transmitted in this format.


3.2 Excel

Takeaway

Excel is often the gateway to data science, and something that every data scientist can benefit from learning.


Excel is often the gateway to data science.

Introduction to Excel
Excel allows you to easily manipulate data with what is essentially a What You See Is What You Get editor that allows you to perform equations on data without working in code at all. It is a handy tool for data analysts who want to get results without programming.

Benefits of Excel
Excel is easy to get started with, and it’s a program that anybody who is in analytics will intuitively grasp. It can be very useful to communicate data to people who may not have any programming skills: they should still be able to play with the data.

Who Uses This
Data analysts tend to use Excel.
Level of Difficulty

Beginner

Sample Project

Importing a small dataset on the statistics of NBA players and making a simple graph of the top scorers in the league

3.3 SQL

Takeaway

SQL is the most popular programming language to find data.


SQL is the most popular programming language to find data.

Introduction to SQL
Data science needs data. SQL is a programming language specially designed to extract data from databases.

Benefits of SQL
SQL is the most popular tool used by data scientists. Most data in the world is stored in tables that will require SQL to access. You’ll be able to filter and sort through the data with it.

Who Uses This
Data analysts and some data engineers tend to use SQL.

Level of Difficulty

Beginner

Sample Project

Using a SQL query to select the top ten most popular songs from a SQL database of the Billboard 100.


3.4 Python

Takeaway

Python is a powerful, versatile programming language for data science.


Python is a powerful, versatile programming language for data science.

Introduction to Python
Once you download Anaconda, an environment manager for Python and get set up on iPython Notebook, you’ll quickly realize how intuitive Python is. A versatile programming language built for everything from building websites to gathering data from across the web, Python has many code libraries dedicated to making data science work easier.

Benefits of Python
Python is a versatile programming language with a simple syntax that is easy to learn.

The average salary range for jobs with Python in their description is around $102,000. Python is the most popular programming language taught in universities:the community of Python programmers is only going to be larger in the years to come. The Python community is passionate about teaching Python, and building useful tools that will save you time and allow you to do more with your data.

Many data scientists use Python to solve their problems: 40% of respondents to a data science survey conducted by O’Reilly used Python, which was more than the 36% who used Excel.

Who Uses This
Data engineers and data scientists will use Python for medium-size data sets.

Level of Difficulty

Intermediate

Sample Project

Using Python to source tweets from celebrities, then doing an analysis of the most frequent words used by applying programming rules.

3.5 R

Takeaway

R is a staple in the data science community because it is designed explicitly for data science needs. It is the most popular programming environment in data science with 43% of data professionals using it.


R is a staple in the data science community because it is
designed explicitly for data science needs.

Introduction to R
R is a programming environment designed for data analysis. R shines when it comes to building statistical models and displaying the results.

Benefits of R
R is slightly more popular than Python in data science, with 43% of data scientists using it in their tool stack compared to the 40% who use Python.

It is an environment where a wide variety of statistical and graphing techniques can be applied.

The community contributes packages that, similar to Python, can extend the core functions of the R codebase so that it can be applied to very specific problems such as measuring financial metrics or analyzing climate data.

Who Uses This
Data engineers and data scientists will use R for medium-size data sets.

Level of Difficulty

Intermediate

Sample Project

Using R to graph stock market movements over the last five years.
6.3.6 Big Data Tools Big data comes from Moore’s Law, a theory that computing power doubles every two years. 
This has led to the rise of massive data sets generated by millions of computers. Imagine how much data Facebook has at any give time!
Any data set that is too large for conventional data tools such as SQL and Excel can be considered big data, according to McKinsey. The simplest definition is that big data is data that can’t fit onto your computer.


3.7 Hadoop

Takeaway

By using Hadoop, you can store your data in multiple servers while
controlling it from one.


By using Hadoop, you can store your data in multiple
servers while controlling it from one.

Introduction to Hadoop

The solution is a technology called MapReduce. MapReduce is an elegant abstraction that treats a series of computers as it were one central server. This allows you to store data on multiple computers, but process it through one.


Benefits of Hadoop

Hadoop is an open-source ecosystem of tools that allow you to MapReduce your data and store enormous datasets on different servers. It allows you to manage much more data than you can on a single computer.

Who Uses This

Data engineers and data scientists will use Hadoop to handle big data sets.

Level of Difficulty
Advanced
Sample Project 
Using Hadoop to store massive datasets that update in real time, such as the number of likes Facebook users generate.


3.8 NoSQL

Takeaway

NoSQL allows you to manage data without unneeded weight.


NoSQL allows you to manage data without unneeded
weight.

Introduction to NoSQL

Tables that bring all their data with them can become cumbersome. NoSQL includes a host of data storage solutions that separate out huge data sets into manageable chunks.

Benefits of NoSQL

NoSQL was a trend that pioneered by Google to deal with the impossibly large amounts of data they were storing. Often structured in the JSON format popular with web developers, solutions like MongoDB have created databases that can be manipulated like SQL tables, but which can store the data with less structure and density.

Who Uses This

Data engineers and data scientists will use NoSQL for big data sets, often website databases for millions of users.

Level of Difficulty
Advanced

Sample Project
Storing data on users of a social media application that is deployed on the web.















Wednesday, August 7, 2019

Data Scientists in Action



1. Day in the Life of a Data Scientist

This story is based on the day-to-day of an industry expert in the financial sector, who wishes to remain anonymous.

Data scientists in finance try to predict whether or not people will default on their credit due to certain predictive factors or they help classify which transactions seem fraudulent. All of this requires a look at millions of lines of data, and it involves extrapolation to the future, a skillset almost all human beings are notoriously bad at. All of this requires a closer look at the data. However, the day-today isn’t just spent looking through numbers.

9 am
There’s a lot of legwork that goes into data science just like any other job. Nearly an hour is spent just catching up on email and organizing for the day ahead.

10 am
A surprisingly high amount of time in data science is spent recruiting. Demand for data science skills is at an all-time high, so data science organizations are often evaluating potential recruits. Data scientists will often take time out of their days to do phone screens of potential new team members.

11 am
Data scientists spend a lot of time in meetings. Almost an hour is spent just making sure that every team is properly aligned with one another, and working on the right things.

12 pm
Lunch offers the chance to relax a bit and catch up with colleagues. Then it’s back to the grind. One half of the typical day is spent coding an analysis or looking over somebody else’s code. This might involve building a graph to represent insights unearthed during a look through the data, or it might just be about making sure your own code is clean so everybody on your team can read through it and understand what is going on.

4 pm
Data scientists will often discuss with groups of fellow data scientists ways that they can collaborate and help one another. They’ll often learn together and share the latest tool that can help improve productivity.




2. Infusing Data in Your Workplace: Chase Lehrman


Chase Lehrman works as a data analyst at a fast-growing education company called Higher Learning Technologies that helps dental and nursing students pass their board exams. He describes his day-to-day as being a data storyteller who looks to gain an understanding of how users are using the product Higher Learning
Technologies sells. He also helps people across the organization get the data they need to make informed decisions: a recent example involved sizing a market.

Thanks to Chase, Higher Learning Technologies can change its static data into usable insights, something every data scientist should get their organization to embrace. Chase makes sure that data problems are framed the right way and that solutions are properly communicated and actionable.


Chase makes sure that data problems are framed the right way.


Data scientists solve many different problems. A data scientist might hunt for raw data. They might be asked to create automated programs that can process data quickly and efficiently. They might be asked to communicate their results and why they matter to the CEO of a company. You will have to learn a versatile skillset, and a variety of tools if you want to become one.



3 Understanding the Data: Sneha Runwal

Sneha Runwal works as a statistician at Apple, where she works in the AppleCare division. Her major work there involves forecasting and time series analysis, in addition to anomaly detection.

She feels that people are often too quick to delve into algorithms and computer code, but it’s important to step back and understand your data before you get into implementation mode. She says she is trying to get more disciplined about this herself. Her advice? Understand as much of your data as possible, as early as you can.




Wednesday, July 24, 2019

Data Science Process

The Data Science Process


what does a Data Scientist do?”. Or “what does a day in the data science life look like?” These questions are tricky. The answer can vary by role and company.

Here’s a summary of insights.




Step 1: Frame the problem
The first thing you have to do before you solve a problem is to define exactly what it is. You need to be able to translate data questions into something actionable.

You’ll often get ambiguous inputs from the people who have problems. You’ll have to develop the intuition to turn scarce inputs into actionable outputs--and to ask the questions that nobody else is asking.

Say you’re solving a problem for the VP Sales of your company. You should start by understanding their goals and the underlying why behind their data questions.

Before you can start thinking of solutions, you’ll want to work with them to clearly define the problem.

A great way to do this is to ask the right questions.

You should then figure out what the sales process looks like, and who the customers are. You need as much context as possible for your numbers to become insights.

You should ask questions like the following:

  1. Who are the customers?
  2. Why are they buying our product?
  3. How do we predict if a customer is going to buy our product?
  4. What is different from segments who are performing well and those that are performing below expectations?
  5. How much money will we lose if we don’t actively sell the product to these groups?


In response to your questions, the VP Sales might reveal that they want to understand why certain segments of customers have bought less than expected. Their end goal might be to determine whether to continue to invest in these segments, or de-prioritize them. You’ll want to tailor your analysis to that problem, and unearth insights that can support either conclusion.

It’s important that at the end of this stage, you have all of the information and context you need to solve this problem. 

You need as much context as possible for your numbers to become insights.


Step 2: Collect the raw data needed for your problem

Once you’ve defined the problem, you’ll need data to give you the insights needed to turn the problem around with a solution. This part of the process involves

thinking through what data you’ll need and finding ways to get that data, whether it’s querying internal databases, or purchasing external datasets.



You might find out that your company stores all of their sales data in a CRM or a customer relationship management software platform.You can export the CRM data in a CSV file for further analysis.

Step 3: Process the data for analysis

Now that you have all of the raw data, you’ll need to process it before you can do any analysis. Oftentimes, data can be quite messy, especially if it hasn’t been well-maintained. You’ll see errors that will corrupt your analysis: values set to null though they really are zero, duplicate values, and missing values. It’s up to you to go through and check your data to make sure you’ll get accurate insights.

You’ll want to check for the following common errors:

  1. Missing values
  2. Corrupted values
  3. Timezone differences
  4. Date range errors, such as data registered from before sales started

You’ll need to look through aggregates of your file rows and columns and sample some test values to see if your values make sense. If you detect something that doesn’t make sense, you’ll need to remove that data or replace it with a default value. You’ll need to use your intuition here: if a customer doesn’t have an initial contact date, does it make sense to say that there was NO initial contact date? Or do you have to hunt down the VP Sales and ask if anybody has data on the customer’s missing initial contact dates?
Once you’re done working with those questions and cleaning your data, you’ll be ready for exploratory data analysis (EDA).

Step 4: Explore the data

When your data is clean, you’ll should start playing with it!

The difficulty here isn’t coming up with ideas to test, it’s coming up with ideas that are likely to turn into insights. You’ll have a fixed deadline for your data science project (your VP Sales is probably waiting on your analysis eagerly!), so you’ll have to prioritize your questions. 

You’ll have to look at interesting patterns that explain why sales are reduced for this group. You might notice that they don’t tend to be very active on social media, with few of them having Twitter or Facebook accounts. You might also notice

that most of them are older than your general audience. From that you can begin to trace patterns you can analyze more deeply.

Step 5: Perform in-depth analysis

This step of the process is where you’re going to have to apply your statistical, mathematical and technological knowledge and leverage all of the data science tools at your disposal to crunch the data and find every insight you can.

In this case, you might have to create a predictive model that compares your underperforming group with your average customer. You might find out that the age and social media activity are significant factors in predicting who will buy the product.

If you’d asked a lot of the right questions while framing your problem, you might realize that the company has been concentrating heavily on social media marketing efforts, with messaging that is aimed at younger audiences.

You would know that certain demographics prefer being reached by telephone rather than by social media. You begin to see how the way the product has been has been marketed is significantly affecting sales: maybe this problem group isn’t a lost cause! A change in tactics from social media marketing to more in-person
interactions could change everything for the better. This is something you’ll have to flag to your VP Sales.

You can now combine all of those qualitative insights with data from your quantitative analysis to craft a story that moves people to action.


Step 6: Communicate results of the analysis 


It’s important that the VP Sales understand why the insights you’ve uncovered are important. Ultimately, you’ve been called upon to create a solution throughout the data science process. Proper communication will mean the difference between action and inaction on your proposals.

Proper communication will mean the difference between action and inaction on your proposals.

You need to craft a compelling story here that ties your data with their knowledge. You start by explaining the reasons behind the underperformance of the older demographic. You tie that in with the answers your VP Sales gave you and the insights you’ve uncovered from the data. Then you move to concrete solutions that address the problem: we could shift some resources from social media to personal calls. You tie it all together into a narrative that solves the pain of your VP Sales: she now has clarity on how she can reclaim sales and hit her objectives.

She is now ready to act on your proposals.

***

As a data scientist, you’ll have to learn how to work through the entire data science process. Here’s how that looks like from day to day


The Different Data Science Roles

The Different Data Science Roles

Before we dive too deep into what skills you need to become a data scientist,you should be aware that there are different roles in data science. Oftentimes, a data science team will rely on different team members for different skill sets. Or the skill set needed may depend on the type of company and part of the organization you work in. You don’t have to become the world’s best at everything.

While there are some basics every data scientist should know (e.g. basic statistics), data science roles can vary significantly in their demands and expectations. 

Let’s look at the some broad categories of roles that all get lumped under the umbrella term “Data Science”


Data Scientists

One definition of a data scientist is someone who knows more programming than a statistician, and more statistics than a software engineer. Data scientists fine-tune the statistical and mathematical models that are applied onto that data. This could involve applying theoretical knowledge of statistics and algorithms to find the best way to solve a data problem.


Data Analysts and Business Analysts

Data analysts sift through data and provide reports and visualizations to explain what insights the data is hiding. When somebody helps people from across the company understand specific queries with charts, they are filling the data analyst
(or business analyst) role. In some ways, you can think of them as junior data scientists, or the first step on the way to a data science job.

Business analysts are a group that’s adjacent to data analysts, and are more concerned with the business implications of the data and the actions that should result. Should the company invest more in project X or project Y? Business analysts will leverage the work of data science teams to communicate an answer.and visualizations
to explain what insights the data is hiding. When Chase from Higher Learning Technologies helps people from across the company understand specific queries with charts, he is filling the business analyst role.

This blog post summarizes some of the differences. You can roughly say that data engineers rely more on engineering skills, data scientists rely more on their training in mathematics and statistics, and business analysts rely more heavily on their communication skills and their domain expertise. You can be sure that people who occupy these roles will have varying amounts of skills outside of their specialties though and that they can all broadly use the skills we describe below.

It’s important to keep this consideration in mind because data science can be a big tent, and you can pick and choose your spots.

For instance, a data scientist might use historical data to build a model that predicts the number of credit card defaults in the following month.

A data scientist will be able to run with data science projects from end-to-end. They can store and clean large amounts of data, explore data sets to identify insights, build predictive models and weave a story around the findings.

Within the broad category of data scientists, you might encounter statisticians who focus on statistical approaches to data, and data managers who focus on running data science teams.

Data scientists are the bridge between programming and implementation of data

For instance, a data scientist might use historical data to build a model that predicts the number of credit card defaults in the following month.

A data scientist will be able to run with data science projects from end-to-end. They can store and clean large amounts of data, explore data sets to identify insights, build predictive models and weave a story around the findings.

Within the broad category of data scientists, you might encounter statisticians who focus on statistical approaches to data, and data managers who focus on running data science teams.

Data scientists are the bridge between programming and implementation of data science, the theory of data science, and the business implications of data.


Data Engineers

Data engineers are software engineers who handle large amounts of data, and often lay the groundwork and plumbing for data scientists to do their jobs effectively. They are responsible for managing database systems, scaling the data architecture to multiple servers, and writing complex queries to sift through the data. They might also clean up data sets, and implement complex requests that come from data scientists, e.g. they take the predictive model from the data scientist and implements it into production-ready code.

Data engineers, in addition to knowing a breadth of programming languages (e.g. Ruby or Python), will usually know some Hadoop-based technologies (e.g. MapReduce, Hive, and Pig) database technologies like MySQL, Cassandra and MongoDB.

Within the broad category of data engineers, you’ll find data architects who focus on structuring the technology that manages data models and database administrators who focus on managing data storage solutions.


Data Analysts and Business Analysts

Data analysts sift through data and provide reports and visualizations to explain what insights the data is hiding. When somebody helps people from across the company understand specific queries with charts, they are filling the data analyst
(or business analyst) role. In some ways, you can think of them as junior data scientists, or the first step on the way to a data science job.

Business analysts are a group that’s adjacent to data analysts, and are more concerned with the business implications of the data and the actions that should result. Should the company invest more in project X or project Y? Business analysts will leverage the work of data science teams to communicate an answer.


Skills

You can roughly say that data engineers rely most heavily on software engineering skills, data scientists rely on their training in statistics and mathematical modeling, and business analysts rely more heavily on their analytical skills and domain expertise. You can be sure that people who occupy these roles will have varying amounts of skills outside of their specialties.

It’s important to keep this consideration in mind because data science can be a big tent, and you can pick and choose your spots, but each spot comes with different needs, and different salaries.


Salary Ranges

Data scientists need to have the broadest set of skills that covers the theory, implementation and communication of data science. As such, they also tend to be the highest compensated group with an average salary above $115,000 USD.

Data engineers focus on setting up data systems and making sure code is clean, and technical systems are well-suited to the amount of data passing back and forth for analysis. They tend to be middle of the pack when it comes to compensation, with an average salary around $100,000 USD.

Data analysts often focus on querying information and communicating insights found to drive action within organizations. While their average salary is around $65,000 USD, this is partly because a lot of data analyst roles are filled by entry-level graduates with limited work experience.

Every one of these roles combines together into a whole data science team that can solve any data problem placed in front of them.


Getting Your First Data Science Job

Data Science to the Rescue


You’ve probably heard that being a data scientist is the sexiest career of the 21st century, one where you can earn a healthy salary, and a great work-life balance.

Google’s Chief Economist, Hal Varian, has said that, “The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades”.

GiveDirectly is just one example of how organizations win by using data to their advantage.

According to LinkedIn, Statistical Analysis & Data Mining were the hottest skills that got recruiters’ attention in 2014. Glassdoor ranked Data Scientist as the #1 job to pursue in 2016. Some people have even called it the sexiest career of the 21st century.

Google’s Chief Economist, Hal Varian, has said that, “The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades”.

But sending people to each village could take several trips at a crushing expense, creating overheads for an organization looking to operate leanly.


“The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades.” 

- Hal Varian, Google’s Chief Economist


Liaising with GiveDirectly, a pair of industry experts from IBM and Enigma set out to see if data science could help.

Using satellite images provided by Google, they were able to use computers to classify which villages had metal roofs on top of their houses, and which ones had thatch. They were able to determine which villages needed the most help without sending a single person to the area.

This required mining satellite data and making sense of massive amounts of data, something that would have been impossible a decade ago. It required implementing machine learning algorithms, a cutting-edge technology at the time, to train computers to recognize patterns.

These data scientists were able to pinpoint where GiveDirectly should operate, saving the organization hundreds of man-hours and allowing it to do what it does best: solving extreme poverty.


What is Data Science?
GiveDirectly is just one example of how organizations win by using data to their advantage.

Around the world, organizations are creating more data every day, yet most are struggling to benefit from it. According to McKinsey, the US alone will face a shortage of 150,000+ data analysts and an additional 1.5 million data-savvy managers.

According to LinkedIn, Statistical Analysis & Data Mining were the hottest skills that got recruiters’ attention in 2014. Glassdoor ranked Data Scientist as the #1 job to pursue in 2016. Harvard Business Review even called it the sexiest career of the 21st century.
GiveDirectly was able to save thousands of dollars and put their money where their mission is thanks to a team of three data scientists. Within the mass of data the world generates every day, similar insights are hidden away. Each may have the potential to transform entire industries, or to improve millions of lives.

Salary trends have followed the impact data science drives. With a national average salary of $118k (which increases to $126k in Silicon Valley), data science has become a lucrative career path where you can solve hard problems and drive social impact.

Since you’re reading this guide, you’re likely curious about a career in Data Science, and you’ve probably heard some of these facts and figures. You likely know that data science is a career where you can do good while doing well.

You’re ready to dig beyond the surface, and see real-life examples of data science, and get real-life advice from practitioners in the field.

That’s exactly why we wrote this guide. To bring data science careers to life, for thousands of data-curious, savvy young professionals. We hope that after reading this guide, you have a solid understanding of the data science industry, and know what it takes to get your first data science job. We also want to leave you
with a checklist of actionable advice which will help you throughout your data science career.


The Foundations of Data Science

DJ Patil, the current Chief Data Scientist of the United States and previously the Head of Data Products at Linkedin, is the one who first coined the term data science.

A decade after it was first used, the term remains contested. There is a lot of debate among practitioners and academics about what data science means, and whether it’s different at all from the data analytics that companies have always done.

One of the most substantive differences is the amount of data you have to process now as opposed to a decade ago. In 2020, the world will generate 50x more data than we generated in 2011. Data science can be considered an interdisciplinary solution to the explosion of data that takes old data analytics approaches, and uses machines to augment and scale their effects on larger data sets.

DJ posits that, “the dominant trait among data scientists is an intense curiosity—a desire to go beneath the surface of a problem, find the questions at its heart, and distill them into a very clear set of hypotheses that can be tested.” There is no mention here of a strict definition of data science, nor of a profile that must fit it.

Baseball players used to be judged by how good scouts thought they looked, not how many times they got on base--that was until the Oakland A’s won an all-time league record 20 games in a row with one of the lowest paid rosters in the league. Elections used to swing from party to party with little semblance of predictive accuracy--that was until Nate Silver correctly predicted every electoral vote in the 2012 elections.




Data, and a systematic approach to uncover truths about the world around us, have changed the world.

“More than anything, what data scientists do is make discoveries while swimming in data. It’s their preferred method of navigating the world around them,” concludes Patil.

To do data science, you have to be able to find and process large datasets. You’ll often need to understand and use programming, math, and technical communication skills.

Most importantly, you need to have a sense of intellectual curiosity to understand the world through data, and not be deterred easily by obstacles.

You might not think you know anything about data science, but if you’ve ever looked for a Wikipedia table to settle a debate with one of your friends, you were doing a little bit of data science.

Friday, October 16, 2015

PL/SQL Performance tuning and optimization


When to Tune PL/SQL Code

The information in this chapter is especially valuable if you are responsible for:


  • Programs that do a lot of mathematical calculations. You will want to investigate the datatypes PLS_INTEGER, BINARY_FLOAT, and BINARY_DOUBLE.
  • Functions that are called from PL/SQL queries, where the functions might be executed millions of times. You will want to look at all performance features to make the function as efficient as possible, and perhaps a function-based index to precompute the results for each row and save on query time.
  • Programs that spend a lot of time processing INSERT, UPDATE, or DELETE statements, or looping through query results. You will want to investigate the FORALL statement for issuing DML, and the BULK COLLECT INTO and RETURNING BULK COLLECT INTO clauses for queries.
  • Older code that does not take advantage of recent PL/SQL language features. (With the many performance improvements in Oracle Database 10g, any code from earlier releases is a candidate for tuning.)
  • Any program that spends a lot of time doing PL/SQL processing, as opposed to issuing DDL statements like CREATE TABLE that are just passed directly to SQL. You will want to investigate native compilation. Because many built-in database features use PL/SQL, you can apply this tuning feature to an entire database to improve performance in many areas, not just your own code.
  • Before starting any tuning effort, benchmark the current system and measure how long particular subprograms take. PL/SQL in Oracle Database 10g includes many automatic optimizations, so you might see performance improvements without doing any tuning.


Oracle PL/SQL Tuning tips


There are several proven techniques for improving the speed of PL/SQL execution and they are presented below in order of importance:

Use bulk collect: When reading-in lots of related rows, bulk collect can run 10x faster than a conventional loop. This tuning is possible because Oracle reduces context switches into a single operation.

Use forall: When loading a table from an array, the forall operator is 10x faster than a conventional SQL insert statement. Again, this is because of the reduction of context switching.

Use SQL analytics: Many advanced data operations can be done without using PL/SQL and are readily available as a SQL built-in function.

Use implicit cursors: Implicit cursors are faster in PL/SQL than explicitly defining cursors

Explain Plan

EXPLAIN PLAN parses a query and records the "plan" that Oracle devises to execute it. By examining this plan, you can find out if Oracle is picking the right indexes and joining your tables in the most efficient manner. There are a few different ways to utilize Explain Plan. We will focus on using it through SQL*Plus since most Oracle programmers have access to SQL*Plus.

Contents


Creating a Plan Table

The first thing you will need to do is make sure you have a table called PLAN_TABLE available in your schema. The following script will create it for you if you don't have it already:
@?/rdbms/admin/utlxplan.sql

Explain Plan Syntax

EXPLAIN PLAN FOR your-sql-statement;
or
EXPLAIN PLAN SET STATEMENT_ID = statement_id FOR your-sql-statement;

Formatting the output

After running EXPLAIN PLAN, Oracle populates the PLAN_TABLE table with data that needs to be formatted to presented to the user in a more readable format. Several scripts exist for this, however, one of the easiest methods available is to cast dbms_xplan.display to a table and select from it (see examples below).

Some Examples

SQL> EXPLAIN PLAN FOR select * from dept where deptno = 40;
Explained.
SQL> set linesize 132
SQL> SELECT * FROM TABLE(dbms_xplan.display);
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------
Plan hash value: 2852011669
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 20 | 1 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 20 | 1 (0)| 00:00:01 |
|* 2 | INDEX UNIQUE SCAN | PK_DEPT | 1 | | 0 (0)| 00:00:01 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("DEPTNO"=40)

14 rows selected.

Using SQL*Plus Autotrace

SQL*Plus also offers an AUTOTRACE facility that will display the query plan and execution statistics as each query executes. Example:
SQL> SET AUTOTRACE ON
SQL> select * from dept where deptno = 40;

    DEPTNO DNAME          LOC
---------- -------------- -------------
        40 OPERATIONS     BOSTON

Execution Plan
----------------------------------------------------------
Plan hash value: 2852011669

--------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 20 | 1 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 20 | 1 (0)| 00:00:01 | |* 2 | INDEX UNIQUE SCAN | PK_DEPT | 1 | | 0 (0)| 00:00:01 | ---------------------------------------------------------------------------------------
Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("DEPTNO"=40) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 2 consistent gets 0 physical reads 0 redo size 443 bytes sent via SQL*Net to client 374 bytes received via SQL*Net from client 1 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed