Hands-on Tutorials

Streaming and Refreshing

Photo by Sasha • Stories on Unsplash

Data scientists use data visualization to communicate data and generate insights. It’s essential for data scientists to know how to create meaningful visualization dashboards, especially real-time dashboards. This article talks about two ways to get your real-time dashboard in Python:

For demonstration purposes, the plots and dashboards are very basic, but you will get the idea of how we do a real-time dashboard.

The code for this article can be found…

Math and gradient descent implementation in Python

Photo by Amy Shamblen on Unsplash

Multiclass logistic regression is also called multinomial logistic regression and softmax regression. It is used when we want to predict more than 2 classes. A lot of people use multiclass logistic regression all the time, but don’t really know how it works. So, I am going to walk you through how the math works and implement it using gradient descent from scratch in Python.

Disclaimer: there are various notations on this topic. I am using the notation that I think is easy to understand and visualize. …

Using pytest and hypothesis for unit testing

Photo by Sarah Kilian on Unsplash

Software testing is essential for software development. It is recommended for software engineers to use test-driven development (TDD), which is a software development process that develops test cases first and then develops the software. For data scientists, it is not always easy and plausible to write tests first. Nevertheless, software testing is so important. Every data scientist should know how to do unit testing and use unit testing in their data science workflow. A lot of data scientists already use assertions, which is a very important first step to test-driven development. This article will step up from assertions and focus…

setup, debug, version control, and deployment

Photo by Greg Rakozy on Unsplash

Many data scientists like to use Jupyter Notebook or JupyterLab to do their data explorations, visualizations, and model building. I know some data scientists refuse to use Jupyter Notebook. But, I love to use Jupyter Notebook/Lab to do my experiments and explorations. Here is a Jupyter Notebook workflow that might be helpful.


Setup environment

Whenever you are working on a new project, you should always have a new fresh environment to start with. If you are lazy and don’t want to create a new environment for every project, you should at least create one new environment that’s separate from your base environment…

cProfile and line_profiler

Profiling helps us identify bottlenecks and optimize performance in our code. If our code runs slowly, we can identify where our code is slow and then make corresponding improvements.

Here are the explanations of Python profilers from the Python documentation:

cProfile and profile provide deterministic profiling of Python programs. A profile is a set of statistics that describes how often and for how long various parts of the program executed…cProfile is recommended for most users; it’s a C extension with reasonable overhead that makes it suitable for profiling long-running programs.

cProfile is a Python built-in profiler, which is great for…

Github 101 for team projects

Photo by ali nafezarefi on Unsplash

If you are working by yourself, then git clone, git status, git add, git commit, git push would probably be sufficient for your work. However, if you ever work on a team project with other data scientists and software engineers, it is better to use forks and branches. Here is the git workflow for you if you are on a team project:

Step 1: git clone

The first step is to git clone your team’s project to your local machine, and then get in the project folder:

git clone https://github.com/your-team/team-project.gitcd team-project 

Step 2: Fork the team project you are going to work on

It’s better if you commit your changes to a local branch and…

Using simple_salesforce Python API


Salesforce is probably the most annoying database I have worked with. This web page illustrates the relationships among the objects (i.e., data tables) stored in Salesforce. You might think it doesn’t look that bad. Well, in reality, the relationship of your tables can be a lot messier than this illustration. Let’s work through this and see how we can query salesforce data while remaining sane.


Where can we find what tables and variables are available in Salesforce? To get an overview of all the tables (objects) and variables (entities). We can go to Salesforce `Developer Console — File — Open…

How to approach a text classification problem

Source: Unsplash

Imagine we have a large number of text files and we need to classify these text files into different topics. What should we do? This article will walk you through an overview of text classification and how I would approach this problem on a high-level basis. I would like to address this problem in three steps — data preparation and exploration, labeling, and modeling.

Data Preparation and Data Exploration

The first step is data preparation and exploration. I will transform our text data into a matrix representation through different word embedding methods. …

MRR and Churn calculations

source: https://unsplash.com/photos/ZVprbBmT8QA

Stripe is an online payment company that offers software and APIs for processing payments and business management. I love that Stripe has different APIs for different languages, which makes people’s lives a lot easier.

I primarily use the Stripe Python API. To install:

pip install --upgrade stripe

You can also do conda install stripe. But the most recent version of the library doesn’t seem to be available on Conda yet. The version I am using is Stripe 2.55.0.

Next, you will need an API key to access the stripe API. Go to stripe.com …

Getting Started

Bigram/trigram, sentiment analysis, and topic modeling

Source: https://unsplash.com/photos/uGP_6CAD-14

This article talks about the most basic text analysis tools in Python. We are not going into the fancy NLP models. Just the basics. Sometimes all you need is the basics :)

Let’s first get some text data. Here we have a list of course reviews that I made up. What can we do with this data? The first question that comes to mind is can we tell which reviews are positive and which are negative? Can we do some sentiment analysis on these reviews?

corpus = [
'Great course. Love the professor.',
'Great content. Textbook was great',
'This course has very hard…

Sophia Yang

Ph.D. | Senior Data Scientist @ Anaconda | All views are my own

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store