Removing duplicates is an essential skill to get accurate counts because we often don’t want to count the same thing multiple times. In Python, this could be accomplished by using the Pandas module, which has a method known as drop_duplicates.

Let’s understand how to use it with the help of a few examples.

Dropping Duplicate Names

Let’s say we have a DataFrame that contains vet visits, & the vet’s office wants to know how many dogs of each breed have visited their office. However, there are dogs like Max & Stella, who have visited the vet more than once in your dataset. …

Image for post
Image for post

We can write our very own Python functions using the def keyword, function headers, docstrings, & function bodies. However, there’s a quicker way to write functions on the fly, & these are called lambda functions because we use the keyword lambda.

Some function definitions are simple enough that they can be converted to a lambda function. By doing this, we write fewer lines of code, which is pretty awesome & will come in handy, especially when we’re writing & maintaining big programs.

Lambda Function

Here we rewrite our function raise_to_power as a lambda function. After the keyword lambda, we specify the names of the arguments; then, we use a colon followed by the expression that specifies what we wish the function to return. …

Image for post
Image for post

If we have a DataFrame & would like to access or select a specific few rows/columns from that DataFrame, we can use square brackets or other advanced methods such as loc & iloc.

Selecting Columns Using Square Brackets

Now suppose that we want to select the country column from the brics DataFrame. To achieve this, we will type brics & then the column label inside the square brackets.

Selecting a Column

Image for post
Image for post

We’ll learn techniques on how to clean messy data in SQL, which is a must-have skill for any Data Scientist

Real world data is almost always messy. As a data scientist or a data analyst or even as a developer, if you need to discover facts about data, it’s vital to ensure that data is tidy enough for doing that.

In this tutorial, we will be practicing some of the most common data cleaning techniques in SQL. We will create our own dummy dataset, but the techniques can be applied to the real world data (of the tabular form) as well. …

Learn how to use aggregate functions for summarizing results & gaining useful insights about data in SQL.

Building reports from a given dataset is an essential skill if you are working with data. Because ultimately, you want to be able to answer critical business questions using the data at your disposal. Many times, these answers presented in the form of report charts. But sometimes, reports in the form of tables are also needed. In both cases, you might need to summarize the data using simple calculations. In SQL, you can summarize/aggregate the data using aggregate functions. …

This blog is simply a walkthrough of my Coursera Project, “Battle of the Neighborhoods”.


This project was done in the midst of the pandemic so some of the data may be slightly outdated & it’s also not the best time to start a business up.

Image for post
Image for post


New York City will always be my favorite city in the world due to its melting pot of diverse cultures & ethnicities which brings various innovative ideas & technologies for the modern world. Exploring NYC data was entertaining to say the least. New York City is one of the most populous city in United States with a population of 8.39 million in 2020. It is a hub of diverse cultures as I stated earlier, combining all facets of the globe. NYC is a major industrial center & the financial capital of the world. There are five boroughs in the city; & with such a large geographical area, there is a huge competition between companies. Therefore, there is a big challenge in figuring out the most ideal spots to open up a new business & maximizing profits. …

SQL Server is a relational database management system. One of the key principles of the relational database is that data is stored across multiple tables.

Image for post
Image for post

We will need to be able to join tables together in order to extract the data we need. We use primary and foreign keys to join tables.

Primary Key

A primary key is a column that is used to uniquely identify each row in a table. This uniqueness can be achieved by using a sequential integer as an identity column. Or sometimes, existing columns naturally contain unique values & they can be used.

In the below example, we can see the first few rows from the artist table. It has two columns, artist_id & name. The artist_id column acts as a primary key for this table, it is an integer column, & each value is different. …

Image for post
Image for post

SQL (Structured Query Language) is the native language for interacting with databases & is designed for exactly this purpose. It is a language of databases. A database models real-life entities like professors & universities by storing them in tables. Each table contains data from a single entity type. This reduces redundancy by storing entities only once. For example, there only needs to be one row of data containing a certain company’s details. Lastly, a database can be used to model the relationship between entities.

Querying Databases

While SQL can be used to create & modify databases, this tutorial’s focus will be on querying databases. A query is a request for data from a database table (or combination of tables). Querying is an essential skill for a data scientist since the data you need for your analyses will often live in databases. …

Image for post
Image for post

If you’re familiar with Python or any other programming language, you’ll undoubtedly know that variables need to be defined before they can be used in your program. We’ll start off with variable initialization. Then, we’ll get familiar with the boundary of variables within a program (it’s “scope”). We’ll learn about the four different scopes with the help of examples: Local, Enclosing, Global, & Built-in. These scopes together form the basis for the LEGB rule used by the Python interpreter when working with variables. …

Image for post
Image for post

Finding interesting bits of data in a DataFrame is often easier if you change the rows’ order. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is more common if you sort on a categorical variable), you may want to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.


Jason Joseph

Data Scientist & Machine Learning Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store