Removing duplicates is an essential skill to get accurate counts because we often don’t want to count the same thing multiple times. In Python, this could be accomplished by using the Pandas module, which has a method known as drop_duplicates
.
Let’s understand how to use it with the help of a few examples.
Let’s say we have a DataFrame that contains vet visits, & the vet’s office wants to know how many dogs of each breed have visited their office. However, there are dogs like Max & Stella, who have visited the vet more than once in your dataset. …
We can write our very own Python functions using the def keyword, function headers, docstrings, & function bodies. However, there’s a quicker way to write functions on the fly, & these are called lambda
functions because we use the keyword lambda
.
Some function definitions are simple enough that they can be converted to a lambda function. By doing this, we write fewer lines of code, which is pretty awesome & will come in handy, especially when we’re writing & maintaining big programs.
Lambda
FunctionHere we rewrite our function raise_to_power
as a lambda function. After the keyword lambda
, we specify the names of the arguments; then, we use a colon followed by the expression that specifies what we wish the function to return. …
If we have a DataFrame & would like to access or select a specific few rows/columns from that DataFrame, we can use square brackets or other advanced methods such as loc
& iloc
.
Now suppose that we want to select the country column from the brics
DataFrame. To achieve this, we will type brics
& then the column label inside the square brackets.
Selecting a Column
We’ll learn techniques on how to clean messy data in SQL, which is a must-have skill for any Data Scientist
Real world data is almost always messy. As a data scientist or a data analyst or even as a developer, if you need to discover facts about data, it’s vital to ensure that data is tidy enough for doing that.
In this tutorial, we will be practicing some of the most common data cleaning techniques in SQL. We will create our own dummy dataset, but the techniques can be applied to the real world data (of the tabular form) as well. …
Learn how to use aggregate functions for summarizing results & gaining useful insights about data in SQL.
Building reports from a given dataset is an essential skill if you are working with data. Because ultimately, you want to be able to answer critical business questions using the data at your disposal. Many times, these answers presented in the form of report charts. But sometimes, reports in the form of tables are also needed. In both cases, you might need to summarize the data using simple calculations. In SQL, you can summarize/aggregate the data using aggregate functions. …
This blog is simply a walkthrough of my Coursera Project, “Battle of the Neighborhoods”.
*Disclaimer:
This project was done in the midst of the pandemic so some of the data may be slightly outdated & it’s also not the best time to start a business up.
New York City will always be my favorite city in the world due to its melting pot of diverse cultures & ethnicities which brings various innovative ideas & technologies for the modern world. Exploring NYC data was entertaining to say the least. New York City is one of the most populous city in United States with a population of 8.39 million in 2020. It is a hub of diverse cultures as I stated earlier, combining all facets of the globe. NYC is a major industrial center & the financial capital of the world. There are five boroughs in the city; & with such a large geographical area, there is a huge competition between companies. Therefore, there is a big challenge in figuring out the most ideal spots to open up a new business & maximizing profits. …
SQL Server is a relational database management system. One of the key principles of the relational database is that data is stored across multiple tables.
We will need to be able to join tables together in order to extract the data we need. We use primary
and foreign
keys to join tables.
A primary key is a column that is used to uniquely identify each row in a table. This uniqueness can be achieved by using a sequential integer as an identity column. Or sometimes, existing columns naturally contain unique values & they can be used.
In the below example, we can see the first few rows from the artist
table. It has two columns, artist_id
& name
. The artist_id
column acts as a primary key for this table, it is an integer column, & each value is different. …
SQL (Structured Query Language) is the native language for interacting with databases & is designed for exactly this purpose. It is a language of databases. A database models real-life entities like professors & universities by storing them in tables. Each table contains data from a single entity type. This reduces redundancy by storing entities only once. For example, there only needs to be one row of data containing a certain company’s details. Lastly, a database can be used to model the relationship between entities.
While SQL can be used to create & modify databases, this tutorial’s focus will be on querying databases. A query is a request for data from a database table (or combination of tables). Querying is an essential skill for a data scientist since the data you need for your analyses will often live in databases. …
If you’re familiar with Python or any other programming language, you’ll undoubtedly know that variables need to be defined before they can be used in your program. We’ll start off with variable initialization. Then, we’ll get familiar with the boundary of variables within a program (it’s “scope”). We’ll learn about the four different scopes with the help of examples: Local, Enclosing, Global, & Built-in. These scopes together form the basis for the LEGB rule used by the Python interpreter when working with variables. …
Finding interesting bits of data in a DataFrame is often easier if you change the rows’ order. You can sort the rows by passing a column name to .sort_values()
.
In cases where rows have the same value (this is more common if you sort on a categorical variable), you may want to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.
About