SQL Server is a relational database management system. One of the key principles of the relational database is that data is stored across multiple tables.

Image for post
Image for post

We will need to be able to join tables together in order to extract the data we need. We use primary and foreign keys to join tables.

Primary Key

A primary key is a column that is used to uniquely identify each row in a table. This uniqueness can be achieved by using a sequential integer as an identity column. Or sometimes, existing columns naturally contain unique values & they can be used.

In the below example, we can see the first few rows from the artist table. It has two columns, artist_id & name. The artist_id column acts as a primary key for this table, it is an integer column, & each value is different. …


Image for post
Image for post

SQL (Structured Query Language) is the native language for interacting with databases & is designed for exactly this purpose. It is a language of databases. A database models real-life entities like professors & universities by storing them in tables. Each table contains data from a single entity type. This reduces redundancy by storing entities only once. For example, there only needs to be one row of data containing a certain company’s details. Lastly, a database can be used to model the relationship between entities.

Querying Databases

While SQL can be used to create & modify databases, this tutorial’s focus will be on querying databases. A query is a request for data from a database table (or combination of tables). Querying is an essential skill for a data scientist since the data you need for your analyses will often live in databases. …


Image for post
Image for post

If you’re familiar with Python or any other programming language, you’ll undoubtedly know that variables need to be defined before they can be used in your program. We’ll start off with variable initialization. Then, we’ll get familiar with the boundary of variables within a program (it’s “scope”). We’ll learn about the four different scopes with the help of examples: Local, Enclosing, Global, & Built-in. These scopes together form the basis for the LEGB rule used by the Python interpreter when working with variables. …


Image for post
Image for post

Finding interesting bits of data in a DataFrame is often easier if you change the rows’ order. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is more common if you sort on a categorical variable), you may want to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.


Image for post
Image for post

One alternative to using a loop to iterate over a DataFrame is to use the pandas .apply() method. This function acts as a map() function in Python. It takes a function as an input and applies this function to an entire DataFrame.

If you are working with tabular data, you must specify an axis you want your function to act on (0 for columns; and 1 for rows).

Much like the map() function, the apply()method can also be used with anonymous functions or lambda functions. Let's look at some apply() examples using baseball data.

Much like the map() function, the apply()method can also be used with anonymous functions or lambda functions. Let's look at some apply() examples using baseball data. …


If you are looking to find or replace items in a string, Python has several built-in-methods that can help you search a target string for a specified substring.

.find() Method

Image for post
Image for post

Note: start and end are optional arguments.

From the above syntax, we can observe that the .find() method takes the desired substring as the mandatory argument. We can specify the other two arguments: an inclusive starting position & an exclusive ending position.

Substring Search

In the example code, you search for Waldo in the string Where's Waldo?. The .find()


Image for post
Image for post

An objected-oriented approach is most useful when your code involves complex interactions of many objects. In real production code, classes can have dozens of attributes & methods with complicated logic, but the underlying structure is the same as with the most simple class.

Classes are like a blueprint for objects outlining possible behaviors & states that every object of a certain type could have. For example, if you say, “Every customer will have a phone number & an e-mail, & will be able to place and cancel orders”, you just defined what a class really is. This way, you can talk about customers in a unified way. …


Image for post
Image for post

Classification techniques are an essential part of machine learning & data mining applications. Approximately 70% of problems in Data Science are classification problems. There are lots of classification problems that are available, but the logistics regression is common and is a useful regression method for solving the binary classification problem. Another category of classification is Multinomial classification, which handles the issues where multiple classes are present in the target variable. For example, the IRIS dataset is a very famous example of multi-class classification. Other examples included fall under the category of classifying articles, blogs, & documents.

Logistic Regression is one of the most simple & commonly used Machine Learning algorithms for two-class classification. It’s easy to implement & can be used as the baseline for any binary classification problem. Its basic fundamental concepts are also constructive in deep learning. Logistic Regression describes & estimates the relationship between one and dependent binary variable and independent variables. It’s a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature which means that there are only two possible classes. For example, it can be used for cancer detection problems. It computes the probability of an event occurrence. …


Image for post
Image for post

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.

Dimensions are nothing but features that represent the data. For example, a 28 X 28 image has 784 picture elements (pixels) that are the dimensions or features which together represent that image.

One important thing to note about PCA is that it is an Unsupervised dimensionality reduction technique, you can cluster the similar data points based on the feature correlation between them without any supervision (or labels). …


Image for post
Image for post

Imagine you create a classification model and right off the bat you receive 90% accuracy. These results seem amazing to you but you dive a little deeper into your data and notice that almost an entirety of the data you used belongs to one class. Imbalanced Data can cause a lot of frustration. You feel extremely frustrated when you come to discover that your data has imbalanced classes and that all the great results you thought you were getting turned out to be a lie.

About

Jason Joseph

Data Scientist & Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store