Data scientists use data visualization to communicate data and generate insights. It’s essential for data scientists to know how to create meaningful visualization dashboards, especially real-time dashboards. This article talks about two ways to get your real-time dashboard in Python:
For demonstration purposes, the plots and dashboards are very basic, but you will get the idea of how we do a real-time dashboard.
Multiclass logistic regression is also called multinomial logistic regression and softmax regression. It is used when we want to predict more than 2 classes. A lot of people use multiclass logistic regression all the time, but don’t really know how it works. So, I am going to walk you through how the math works and implement it using gradient descent from scratch in Python.
Disclaimer: there are various notations on this topic. I am using the notation that I think is easy to understand and visualize. …
Software testing is essential for software development. It is recommended for software engineers to use test-driven development (TDD), which is a software development process that develops test cases first and then develops the software. For data scientists, it is not always easy and plausible to write tests first. Nevertheless, software testing is so important. Every data scientist should know how to do unit testing and use unit testing in their data science workflow. A lot of data scientists already use assertions, which is a very important first step to test-driven development. This article will step up from assertions and focus…
Many data scientists like to use Jupyter Notebook or JupyterLab to do their data explorations, visualizations, and model building. I know some data scientists refuse to use Jupyter Notebook. But, I love to use Jupyter Notebook/Lab to do my experiments and explorations. Here is a Jupyter Notebook workflow that might be helpful.
Whenever you are working on a new project, you should always have a new fresh environment to start with. If you are lazy and don’t want to create a new environment for every project, you should at least create one new environment that’s separate from your base environment…
Profiling helps us identify bottlenecks and optimize performance in our code. If our code runs slowly, we can identify where our code is slow and then make corresponding improvements.
Here are the explanations of Python profilers from the Python documentation:
profileprovide deterministic profiling of Python programs. A profile is a set of statistics that describes how often and for how long various parts of the program executed…
cProfileis recommended for most users; it’s a C extension with reasonable overhead that makes it suitable for profiling long-running programs.
cProfile is a Python built-in profiler, which is great for…
If you are working by yourself, then
git clone, git status, git add, git commit, git push would probably be sufficient for your work. However, if you ever work on a team project with other data scientists and software engineers, it is better to use forks and branches. Here is the git workflow for you if you are on a team project:
The first step is to git clone your team’s project to your local machine, and then get in the project folder:
git clone https://github.com/your-team/team-project.gitcd team-project
It’s better if you commit your changes to a local branch and…
Salesforce is probably the most annoying database I have worked with. This web page illustrates the relationships among the objects (i.e., data tables) stored in Salesforce. You might think it doesn’t look that bad. Well, in reality, the relationship of your tables can be a lot messier than this illustration. Let’s work through this and see how we can query salesforce data while remaining sane.
Where can we find what tables and variables are available in Salesforce? To get an overview of all the tables (objects) and variables (entities). We can go to Salesforce `Developer Console — File — Open…
Imagine we have a large number of text files and we need to classify these text files into different topics. What should we do? This article will walk you through an overview of text classification and how I would approach this problem on a high-level basis. I would like to address this problem in three steps — data preparation and exploration, labeling, and modeling.
The first step is data preparation and exploration. I will transform our text data into a matrix representation through different word embedding methods. …
Stripe is an online payment company that offers software and APIs for processing payments and business management. I love that Stripe has different APIs for different languages, which makes people’s lives a lot easier.
I primarily use the Stripe Python API. To install:
pip install --upgrade stripe
You can also do
conda install stripe. But the most recent version of the library doesn’t seem to be available on Conda yet. The version I am using is Stripe 2.55.0.
Next, you will need an API key to access the stripe API. Go to stripe.com …
This article talks about the most basic text analysis tools in Python. We are not going into the fancy NLP models. Just the basics. Sometimes all you need is the basics :)
Let’s first get some text data. Here we have a list of course reviews that I made up. What can we do with this data? The first question that comes to mind is can we tell which reviews are positive and which are negative? Can we do some sentiment analysis on these reviews?
corpus = [
'Great course. Love the professor.',
'Great content. Textbook was great',
'This course has very hard…
Ph.D. | Senior Data Scientist @ Anaconda | All views are my own