More Annotations

Favourite Annotations

Text

PYTHON DATA SCIENCE HANDBOOK Python Data Science Handbook. This website contains the full text of the Python Data Science Handbook by Jake VanderPlas; the content is available on GitHub in the form of Jupyter notebooks. The text is released under the CC-BY-NC-ND license, and code is released under the

MIT license.

IN DEPTH: GAUSSIAN MIXTURE MODELS The k-means clustering model explored in the previous section is simple and relatively easy to understand, but its simplicity leads to practical challenges in its application.In particular, the non-probabilistic nature of k-means and its use of simple distance-from-cluster-center to assign cluster membership leads to poor performance for many real-world situations.

FEATURE ENGINEERING

CUSTOMIZING TICKS

This is an excerpt from the Python Data Science Handbook by Jake VanderPlas; Jupyter notebooks are available on GitHub.. The text is released under the CC-BY-NC-ND license, and code is released under the MIT license.If you find this content useful, please consider supporting the work by buying the book! A WHIRLWIND TOUR OF PYTHON A Whirlwind Tour of Python is a fast-paced introduction to essential features of the Python language, aimed at researchers and developers who are already familiar with programming in another language. The material is particularly designed for those who wish to use Python for data science and/or scientific programming, and in this capacity COMBINING DATASETS: MERGE AND JOIN Relational Algebra¶. The behavior implemented in pd.merge() is a subset of what is known as relational algebra, which is a formal set of rules for manipulating relational data, and forms the conceptual foundation of operations available in most databases.The strength of the relational algebra approach is that it proposes several primitive operations, which become the building blocks of more 1. BASIC PLOTTING WITH PYLAB 1. Basic Plotting with Pylab — mpl-tutorial 0.1 documentation. 1. Basic Plotting with Pylab ¶. Matplotlib Tutorial: 1. Basic Plot Interface. In this notebook, we will explore the basic plot interface using pylab.plot and pylab.scatter. We will also discuss the difference between the pylab interface, which offers plotting with the

feel of Matlab.

FANCY INDEXING

import numpy as np.

MIT license.

FEATURE ENGINEERING

CUSTOMIZING TICKS

feel of Matlab.

FANCY INDEXING

import numpy as np.

DATA INDEXING AND SELECTION A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to standard -based indexing.The purpose of the ix indexer will become more apparent in the context of DataFrame objects, which we will discuss in a moment.. One guiding principle of Python code is that "explicit is better than implicit." The explicit nature of loc and iloc make them very useful in

FEATURE ENGINEERING

This is an excerpt from the Python Data Science Handbook by Jake VanderPlas; Jupyter notebooks are available on GitHub.. The text is released under the CC-BY-NC-ND license, and code is released under the MIT license.If you find this content useful, please consider supporting the work by buying the book! A WHIRLWIND TOUR OF PYTHON This website contains the full text of my free O'Reilly report, A Whirlwind Tour of Python. A Whirlwind Tour of Python is a fast-paced introduction to essential features of the Python language, aimed at researchers and developers who are already familiar with programming in another language. The material is particularly designed for those who wish to use Python for data science and/or INTRODUCING SCIKIT-LEARN There are several Python libraries which provide solid implementations of a range of machine learning algorithms. One of the best known is Scikit-Learn, a package that provides efficient versions of a large number of common algorithms.Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation.

FANCY INDEXING

In the previous sections, we saw how to access and modify portions of arrays using simple indices (e.g., arr), slices (e.g., arr), and Boolean masks (e.g., arr).In this section, we'll look at another style of array indexing, known as fancy indexing.Fancy indexing is like the simple indexing we've already seen, but we pass arrays of indices in place of single scalars. IN-DEPTH: MANIFOLD LEARNING In manifold learning, the globally optimal number of output dimensions is difficult to determine. In contrast, PCA lets you find the output dimension based on the explained variance. In manifold learning, the meaning of the embedded dimensions is not always clear. In PCA, the principal components have a very clear meaning. HIERARCHICAL INDEXING Seeing this, you might wonder why would we would bother with hierarchical indexing at all. The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional Series, we can also use it to represent data of three or more dimensions in a Series or DataFrame.Each extra level in a multi-index represents an extra dimension of data; taking HELP AND DOCUMENTATION IN IPYTHON The Python language and its data science ecosystem is built with the user in mind, and one big part of that is access to documentation. Every Python object contains the reference to a string, known as a doc string, which in most cases will contain a concise summary of the object and how to use it. Python has a built-in help () function that

can

OPERATING ON DATA IN PANDAS Operating on Data in Pandas. One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.). HANDLING MISSING DATA The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. Because it is a Python object, None cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects): In : import numpy as np import pandas as pd. Pythonic Perambulations

* About

* Archive

*

THE WAITING TIME PARADOX, OR, WHY IS MY BUS ALWAYS LATE? Thu 13 September 2018 _Image Source: Wikipedia License CC-BY-SA 3.0_ If you, like me, frequently commute via public transit, you may be familiar with the following situation: > _You arrive at the bus stop, ready to catch your bus: a line that > advertises arrivals every 10 minutes. You glance at your watch and > note the time... and when the bus finally comes 11 minutes later, > you wonder why you always seem to be so unlucky._ Naïvely, you might expect that if buses are coming every 10 minutes and you arrive at a random time, your average wait would be something like 5 minutes. In reality, though, buses do not arrive exactly on schedule, and so you might wait longer. It turns out that under some reasonable assumptions, you can reach a startling conclusion: WHEN WAITING FOR A BUS THAT COMES ON AVERAGE EVERY 10 MINUTES, YOUR AVERAGE WAITING TIME WILL BE 10 MINUTES. This is what is sometimes known as the _waiting time paradox_. I've encountered this idea before, and always wondered whether it is actually true... how well do those "reasonable assumptions" match reality? This post will explore the waiting time paradox from the standpoint of both simulation and probabilistic arguments, and then take a look at some real bus arrival time data from the city of Seattle to (hopefully) settle the paradox once and for all.

simulation statistics SIMULATING CHUTES & LADDERS IN PYTHON Mon 18 December 2017 This weekend I found myself in a particularly drawn-out game of Chutes and Ladders with my four-year-old. If you've not had the pleasure of playing it, Chutes and Ladders (also sometimes known as Snakes and

Ladders ) is a

classic kids board game wherein players roll a six-sided die to advance forward through 100 squares, using "ladders" to jump ahead, and avoiding "chutes" that send you backward. It's basically a glorified random walk with visual aids to help you build a narrative. Thrilling. But she's having fun practicing counting, learning to win and lose gracefully, and developing the requisite skills to be a passionate sports fan , so I play along. On the approximately twenty third game of the morning, as we found ourselves in a near endless cycle of climbing ladders and sliding down chutes, never quite reaching that final square to end the game, I started wondering how much longer the game could last: what is the expected length of a game? How heavy are the tails of the game length distribution? How succinctly could I answer those questions in Python? And then, at some point, it clicked

: Chutes and

Ladders is memoryless — the effect of a roll depends only on where you are, not where you've been — and so it can be modeled as a Markov process! By the time we (finally) hit square 100, I basically had this blog post written, at least in my head. When I tweeted about this

, people

pointed me to a number

of similar

treatments

of

Chutes & Ladders

,

so I'm under no illusion that this idea is original. Think of this as a blog post version of a dad joke: my primary goal is not originality, but self-entertainment, and if anyone else finds it entertaining that's just an added bonus.

simulation animation OPTIMIZATION OF SCIENTIFIC CODE WITH CYTHON: ISING MODEL Mon 11 December 2017 Python is quick and easy to code, but can be slow when doing intensive numerical operations. Translating code to Cython can be helpful, but in most cases requires a bit of trial and error to achieve the optimal result. Cython's tutorials

contain a lot

of information, but for iterative workflows like optimization with Cython, it's often useful to see it done "live". For that reason, I decided to record some screencasts showing this iterative optimization process, using an Ising Model

, as an example

application.

jupyter cython

simulation

INSTALLING PYTHON PACKAGES FROM A JUPYTER NOTEBOOK Tue 05 December 2017 In software, it's said that all abstractions are leaky

,

and this is true for the Jupyter notebook as it is for any other software. I most often see this manifest itself with the following

issue:

> I installed _package X_ and now I can't import it in the notebook.

> Help!

This issue is a perrennial source of StackOverflow questions (e.g.

this

,

that

,

here

,

there

,

another

,

this one

,

that one

,

and this

...

etc.).

Fundamentally the problem is usually rooted in the fact that the JUPYTER KERNELS ARE DISCONNECTED FROM JUPYTER'S SHELL; in other words, the installer points to a different Python version than is being used in the notebook. In the simplest contexts this issue does not arise, but when it does, debugging the problem requires knowledge of the intricacies of the operating system, the intricacies of Python package installation, and the intricacies of Jupyter itself. In other words, the Jupyter notebook, like all abstractions, is leaky. In the wake of several discussions on this topic with colleagues, some

online (exhibit A

, exhibit B

) and some off,

I decided to treat this issue in depth here. This post will address a

couple things:

*

FIRST, I'll provide a quick, bare-bones answer to the general question, _how can I install a Python package so it works with my jupyter notebook, using pip and/or conda?_.

*

SECOND, I'll dive into some of the background of exactly _what_ the Jupyter notebook abstraction is doing, how it interacts with the complexities of the operating system, and how you can think about where the "leaks" are, and thus better understand what's happening when things stop working.

*

THIRD, I'll talk about some ideas the community might consider to help smooth-over these issues, including some changes that the Jupyter, Pip, and Conda developers might consider to ease the cognitive load on

users.

This post will focus on two approaches to installing Python packages:

pip and conda

. Other package managers exist (including platform-specific tools like yum , apt

,

homebrew , etc., as well as cross-platform tools like enstaller ), but I'm less familiar with them and won't be remarking on them further.

jupyter conda

pip

EXPLORING LINE LENGTHS IN PYTHON PACKAGES Thu 09 November 2017 This week, Twitter upped their single-tweet character limit from 140 to 280, purportedly based on this interesting analysis of tweet

lengths

published on Twitter's engineering blog. The gist of the analysis is this: English language tweets display a roughly log-normal distribution of character counts, except near the 140-character limit, at which the distribution spikes: The analysis takes this as evidence that twitter users often "cram" their longer thoughts into the 140 character limit, and suggest that a 280-character limit would more naturally accommodate the distribution of people's desired tweet lengths. This immediately brought to mind another character limit that many Python programmers face in their day-to-day lives: the 79-character line limit suggested by Python's PEP8 style guide

:

> Limit all lines to a maximum of 79 characters. I began to wonder whether popular Python packages (e.g. NumPy, SciPy, Pandas, Scikit-Learn, Matplotlib, AstroPy) display anything similar to what is seen in the distribution of tweet lengths. Spoiler alert: they do! And the details of the distribution reveal some insights into the programming habits and stylistic conventions of the communities who write them.

python data

statistics

EXPOSING PYTHON 3.6'S PRIVATE DICT VERSION

Fri 26 May 2017

I just got home from my sixth PyCon, and it was wonderful as usual. If you weren't able to attend—or even if you were—you'll find a wealth of entertaining and informative talks on the PyCon 2017 YouTube

channel

.

Two of my favorites this year were a complementary pair of talks on Python dictionaries by two PyCon regulars: Raymond Hettinger's Modern Python Dictionaries A confluence of a dozen great ideas and Brandon Rhodes' The Dictionary Even Mightier (a followup of his PyCon 2010 talk, The Mighty Dictionary

)

Raymond's is a fascinating dive into the guts of the CPython dict implementation, while Brandon's focuses more on recent improvements in the dict's user-facing API. One piece both mention is the addition in Python 3.6 of a private dictionary version to aid CPython optimization efforts. In Brandon's words: > "PEP509 added a private > version number... every dictionary has a version number, and > elsewhere in memory a master version counter. And when you go and > change a dictionary the master counter is incremented from a million > to a million and one, and that value a million and one is written > into the version number of that dictionary. So what this means is > that you can come back later and know if it's been modified, without > reading maybe its hundreds of keys and values: you just look and see > if the version has increased since the last time you were there." He later went on to say, > " is internal; I haven't seen an interface for > users to get to it..." which, of course, I saw as an implicit challenge. So let's expose it!

python tutorial

ctypes

A PRACTICAL GUIDE TO THE LOMB-SCARGLE PERIODOGRAM

Thu 30 March 2017

This week I published the preprint of a manuscript that started as a blog post, but quickly out-grew this medium: Understanding the Lomb-Scargle Periodogram . Figure 24 from Understanding the Lomb-Scargle Periodogram . The figure shows the true period vs the periodogram peak for a simulated dataset with an observing cadence typical of ground-based optical astronomy. The simulation reveals common patterns of failure of the Lomb-Scargle method that are not often discussed explicitly, but are straightforward to explain based on the intuition developed in the paper; see Section 7.2 for a detailed discussion. Over the last couple years I've written a number of Python implementations of the Lomb-Scargle periodogram (I'd recommend AstroPy's LombScargle

in most

cases today), and also wrote a marginally popular blog post and somewhat pedagogical paper on the subject. This all has led to a steady trickle of emails from students and researchers asking for advice on applying and interpreting the Lomb-Scargle algorithm, particularly for astronomical data. I noticed that these queries tended to repeat many of the same questions and express some similar misconceptions, and this paper is my attempt to address those once and for all — in a "mere" 55 pages (which includes 26 figures and 4 full pages of references, so it's not all that bad).

lomb-scargle

GROUP-BY FROM SCRATCH

Wed 22 March 2017

I've found one of the best ways to grow in my scientific coding is to spend time comparing the efficiency of various approaches to implementing particular algorithms that I find useful, in order to build an intuition of the performance of the building blocks of the scientific Python ecosystem. In this vein, today I want to take a look at an operation that is in many ways fundamental to data-driven exploration: the group-by, otherwise known as the split-apply-combine

pattern. An

architypical example of a summation group-by is shown in this figure, borrowed from the Aggregation and Grouping section of the Python Data Science Handbook

:

The basic idea is to split the data into groups based on some value, apply a particular operation to the subset of data within each group (often an aggregation), and then combine the results into an output dataframe. Python users generally turn to the Pandas library for this type of operation, where it is is implemented effiently via a concise object-oriented API:

pandas python

benchmarks

TRIPLE PENDULUM CHAOS!

Wed 08 March 2017

Earlier this week a tweet made the rounds which features a video that nicely demonstrates chaotic dynamical systems in action: _Edit: a reader pointed out that the original creator of this animation posted it on reddit

in 2016._

Naturally, I immediately wondered whether I could reproduce this simlulation in Python. This post is the result.

matplotlib animation

simulation

REPRODUCIBLE DATA ANALYSIS IN JUPYTER

Fri 03 March 2017

Jupyter notebooks provide a useful environment for interactive exploration of data. A common question I get, though, is how you can progress from this nonlinear, interactive, trial-and-error style of exploration to a more linear and reproducible analysis based on organized, packaged, and tested code. This series of videos presents a case study in how I personally approach reproducible data analysis within the Jupyter notebook. Each video is approximately 5-8 minutes; the videos are available in a

YouTube Playlist

.

Alternatively, below you can find the videos with some description and links to relevant resources

data pandas

jupyter

← Past

Details

Image Url

HTML Url

Moderation By

More Annotations

Paul Gonzalez

2019-08-04 21:21:46

Paul Gonzalez

2019-08-04 21:22:02

Paul Gonzalez

2019-08-04 21:22:13

Paul Gonzalez

2019-08-04 21:30:18

Paul Gonzalez

2019-08-04 21:30:27

Paul Gonzalez

2019-08-04 21:33:09

Paul Gonzalez

2019-08-04 21:33:18

Paul Gonzalez

2019-08-04 21:33:37

Paul Gonzalez

2019-08-04 21:33:55

Paul Gonzalez

2019-08-04 21:36:12

Paul Gonzalez

2019-08-04 21:36:29

Paul Gonzalez

2019-08-04 21:36:42

Favourite Annotations

Paul Gonzalez

2020-12-09 18:27:57

Paul Gonzalez

2020-12-09 18:28:13

Paul Gonzalez

2020-12-09 18:28:32

Paul Gonzalez

2020-12-09 18:29:06

Paul Gonzalez

2020-12-09 18:29:28

Paul Gonzalez

2020-12-09 18:29:45

Paul Gonzalez

2020-12-09 18:30:10

Paul Gonzalez

2020-12-09 18:30:46

Paul Gonzalez

2020-12-09 18:31:20

Paul Gonzalez

2020-12-09 18:31:37

Paul Gonzalez

2020-12-09 18:31:51

Paul Gonzalez

2020-12-09 18:32:11

Text

MIT license.

FEATURE ENGINEERING

CUSTOMIZING TICKS

feel of Matlab.

FANCY INDEXING

import numpy as np.

MIT license.

FEATURE ENGINEERING

CUSTOMIZING TICKS

feel of Matlab.

FANCY INDEXING

import numpy as np.

FEATURE ENGINEERING

FANCY INDEXING

can

* About

* Archive

*

Read more →

Ladders ) is a

: Chutes and

, people

of similar

treatments

of

Chutes & Ladders

,

Read more →

contain a lot