More Annotations

Favourite Annotations

Text

WES MCKINNEYABOUTTALKSBOOKBLOG Wes McKinney. I published the first edition in 2012, and the 2nd edition was published in 2017. Click to read more. PYTHON FOR DATA ANALYSIS BOOK Python for Data Analysis Book. The 2nd Edition of my book was released digitally on September 25, 2017, with print copies shipping a few weeks later. The 1st Edition was published in October, 2012. STREAMING COLUMNAR DATA WITH APACHE ARROW Over the past couple weeks, Nong Li and I added a streaming binary format to Apache Arrow, accompanying the existing random access / IPC file format. We have implementations in Java and C++, plus Python bindings. In this post, I explain how the format works and show how you can achieve very high data throughput to pandas DataFrames. NATIVE HADOOP FILE SYSTEM (HDFS) CONNECTIVITY IN PYTHON There have been many Python libraries developed for interacting with the Hadoop File System, HDFS, via its WebHDFS gateway as well as its native Protocol Buffers-based RPC interface. I'll give you an overview of what's out there and show some engineering I've been doing to offer a high performance HDFS interface within the developing Arrow ecosystem. This blog is a follow up to my APACHE ARROW AND THE "10 THINGS I HATE ABOUT PANDAS" FROM ARROW TO PANDAS AT 10 GIGABYTES PER SECOND To go back to pandas-land, call the table's to_pandas method. This supports a multithreaded conversion, so let's do a single-threaded conversion for comparison: >>> %timeit df2 = table.to_pandas(nthreads=1) 10 loops, best of 3: 158 ms per loop. This is 6.33 GB/s, or about 20% slower than purely memcpy-based

construction.

THE ULTRARICH'S DIRTY SECRET: NOT PAYING TAXES The tax situation in the United States is pretty messed up. Much ado has been made over the last week about AOC suggesting bringing back the pre-Reagan era 70% marginal tax rate on regular income over $10,000,000. This means that if you made $12 million in normal income SPYING ON INSTANCE METHODS WITH PYTHON'S MOCK MODULE Spying on instance methods with Python's mock module. Python's mock module ( unittest.mock in Python 3.3 and higher) allows you to observe parameters passed to functions. I'm a little slow, so I had to dig around to figure out how to do this. Let's say you have a class: Now, let's suppose you are testing the functionality of ProductionClass DO AVERAGE CONSUMERS STILL NEED DROPBOX? TL;DR: At the risk of stating the obvious, manual management of files on disks now in 2016 is increasingly old-fashioned and largely unnecessary, especially among the non-technorati. Encapsulated / managed cloud services and consumer web applications have made it anachronistic for most normal people. Whether this is a good thing can be debated, but it is happening nonetheless. WHY PANDAS USERS SHOULD BE EXCITED ABOUT APACHE ARROW Why pandas users should be excited about Apache Arrow. I'm super excited to be involved in the new open source Apache Arrow community initiative. For Python (and R, too!), it will help enable. Closer to native performance Python extensions for big data systems like Apache Spark. New in-memory analytics functionality for nested / JSON-like

data.

construction.

data.

ABOUT - WES MCKINNEY Biography: Wes McKinney is an open source software developer focusing on data analysis tools. He created the Python pandas project and is a co-creator of Apache Arrow, his current development focus. He authored 2 editions of the reference book "Python for Data Analysis". Wes is a Member of The Apache Software Foundation and also a PMC member

PRESENTATIONS

Upcoming 2020-02-26: ScaledML (Mountain View, CA) 2020-05-08: NYC R Conference (New York, NY) 2020-05-26: Ray Summit (San Francisco, CA) Past 2020-02-08: PyCon Colombia (Medellín, CO) Slides 2019-10-22: OmniSci Converge (Mountain View, CA) 2019-05-10: NYC R Conference (New York, New York) Apache Arrow: Leveling Up the Data Science Stack

Slides 2019-01-11

federal government.

AVOID UNSIGNED INTEGERS IN C++ IF YOU CAN Unsigned integers (size_t, uint32_t, and friends) can be hazardous, as signed-to-unsigned integer conversions can happen without so much as a compiler warning.An example: size_t as an index variable Occasionally, discussions come up about using unsigned integers as index variables for STL containers (whose size() attribute is unsigned). So the debate is between effectively these two EVEN EASIER FREQUENCY TABLES IN PANDAS 0.7.0 Even easier frequency tables in pandas 0.7.0. I put in a little work on a new crosstab function in the main pandas namespace. It's basically a convenient shortcut to calling pivot_table to make it easy to compute cross-tabulations for a set of factors using pandas DataFrame or even vanilla NumPy arrays! Here's an example: LEAVING NYC FOR NASHVILLE For ten out of the last eleven years, I've lived in two places: New York City and San Francisco. The last two years have been in NYC. After founding Ursa Labs, a not-for-profit open source development group, I felt it was time to make my home somewhere that isn't either of those places.After some contemplation and consulting many friends, I decided on Nashville, Tennessee. SPEEDING UP PANDAS'S FILE PARSERS WITH CYTHON For multi-dimensional arrays, you specify the number of dimensions in the buffer. and pass multiple indexes (Py_ssize_t is the proper C "index type" to use).I'll demonstrate this in: Converting rows to columns faster than zip(*rows) FEATHER FORMAT UPDATE: WHENCE AND WHITHER? Earlier this year, development for the Feather file format moved to the Apache Arrow codebase. I will explain how this has already affected Feather and what to WES MCKINNEYABOUTTALKSBOOKBLOG Wes McKinney. I published the first edition in 2012, and the 2nd edition was published in 2017. Click to read more. PYTHON FOR DATA ANALYSIS BOOK Python for Data Analysis Book. The 2nd Edition of my book was released digitally on September 25, 2017, with print copies shipping a few weeks later. The 1st Edition was published in October, 2012. NATIVE HADOOP FILE SYSTEM (HDFS) CONNECTIVITY IN PYTHON There have been many Python libraries developed for interacting with the Hadoop File System, HDFS, via its WebHDFS gateway as well as its native Protocol Buffers-based RPC interface. I'll give you an overview of what's out there and show some engineering I've been doing to offer a high performance HDFS interface within the developing Arrow ecosystem. This blog is a follow up to my STREAMING COLUMNAR DATA WITH APACHE ARROW Over the past couple weeks, Nong Li and I added a streaming binary format to Apache Arrow, accompanying the existing random access / IPC file format. We have implementations in Java and C++, plus Python bindings. In this post, I explain how the format works and show how you can achieve very high data throughput to pandas DataFrames. FROM ARROW TO PANDAS AT 10 GIGABYTES PER SECOND To go back to pandas-land, call the table's to_pandas method. This supports a multithreaded conversion, so let's do a single-threaded conversion for comparison: >>> %timeit df2 = table.to_pandas(nthreads=1) 10 loops, best of 3: 158 ms per loop. This is 6.33 GB/s, or about 20% slower than purely memcpy-based

construction.

APACHE ARROW AND THE "10 THINGS I HATE ABOUT PANDAS" DO AVERAGE CONSUMERS STILL NEED DROPBOX? TL;DR: At the risk of stating the obvious, manual management of files on disks now in 2016 is increasingly old-fashioned and largely unnecessary, especially among the non-technorati. Encapsulated / managed cloud services and consumer web applications have made it anachronistic for most normal people. Whether this is a good thing can be debated, but it is happening nonetheless. AVOID UNSIGNED INTEGERS IN C++ IF YOU CAN Unsigned integers (size_t, uint32_t, and friends) can be hazardous, as signed-to-unsigned integer conversions can happen without so much as a compiler warning.An example: size_t as an index variable Occasionally, discussions come up about using unsigned integers as index variables for STL containers (whose size() attribute is unsigned). So the debate is between effectively these two SPYING ON INSTANCE METHODS WITH PYTHON'S MOCK MODULE Spying on instance methods with Python's mock module. Python's mock module ( unittest.mock in Python 3.3 and higher) allows you to observe parameters passed to functions. I'm a little slow, so I had to dig around to figure out how to do this. Let's say you have a class: Now, let's suppose you are testing the functionality of ProductionClass WHY PANDAS USERS SHOULD BE EXCITED ABOUT APACHE ARROW Why pandas users should be excited about Apache Arrow. I'm super excited to be involved in the new open source Apache Arrow community initiative. For Python (and R, too!), it will help enable. Closer to native performance Python extensions for big data systems like Apache Spark. New in-memory analytics functionality for nested / JSON-like

data.

WES MCKINNEYABOUTTALKSBOOKBLOG Wes McKinney. I published the first edition in 2012, and the 2nd edition was published in 2017. Click to read more. PYTHON FOR DATA ANALYSIS BOOK Python for Data Analysis Book. The 2nd Edition of my book was released digitally on September 25, 2017, with print copies shipping a few weeks later. The 1st Edition was published in October, 2012. NATIVE HADOOP FILE SYSTEM (HDFS) CONNECTIVITY IN PYTHON There have been many Python libraries developed for interacting with the Hadoop File System, HDFS, via its WebHDFS gateway as well as its native Protocol Buffers-based RPC interface. I'll give you an overview of what's out there and show some engineering I've been doing to offer a high performance HDFS interface within the developing Arrow ecosystem. This blog is a follow up to my STREAMING COLUMNAR DATA WITH APACHE ARROW Over the past couple weeks, Nong Li and I added a streaming binary format to Apache Arrow, accompanying the existing random access / IPC file format. We have implementations in Java and C++, plus Python bindings. In this post, I explain how the format works and show how you can achieve very high data throughput to pandas DataFrames. FROM ARROW TO PANDAS AT 10 GIGABYTES PER SECOND To go back to pandas-land, call the table's to_pandas method. This supports a multithreaded conversion, so let's do a single-threaded conversion for comparison: >>> %timeit df2 = table.to_pandas(nthreads=1) 10 loops, best of 3: 158 ms per loop. This is 6.33 GB/s, or about 20% slower than purely memcpy-based

construction.

data.

PRESENTATIONS

Slides 2019-01-11

ARCHIVES - WES MCKINNEY Fri 22 March 2013 I'm moving to San Francisco. And hiring. Mon 11 February 2013 Whirlwind tour of pandas in 10 minutes. Mon 10 December 2012 Update on upcoming pandas v0.10, new file parser, other performance wins. Thu 04 October 2012 A new high performance, memory-efficient file parser engine for pandas. APACHE ARROW AND THE "10 THINGS I HATE ABOUT PANDAS" This post is the first of many to come on Apache Arrow, pandas, pandas2, and the general trajectory of my work in recent times and into the foreseeable future. This is a bit of a read and overall fairly technical, but if interested I encourage you to take the time AVOID UNSIGNED INTEGERS IN C++ IF YOU CAN Unsigned integers (size_t, uint32_t, and friends) can be hazardous, as signed-to-unsigned integer conversions can happen without so much as a compiler warning.An example: size_t as an index variable Occasionally, discussions come up about using unsigned integers as index variables for STL containers (whose size() attribute is unsigned). So the debate is between effectively these two EVEN EASIER FREQUENCY TABLES IN PANDAS 0.7.0 Even easier frequency tables in pandas 0.7.0. I put in a little work on a new crosstab function in the main pandas namespace. It's basically a convenient shortcut to calling pivot_table to make it easy to compute cross-tabulations for a set of factors using pandas DataFrame or even vanilla NumPy arrays! Here's an example: LEAVING NYC FOR NASHVILLE For ten out of the last eleven years, I've lived in two places: New York City and San Francisco. The last two years have been in NYC. After founding Ursa Labs, a not-for-profit open source development group, I felt it was time to make my home somewhere that isn't either of those places.After some contemplation and consulting many friends, I decided on Nashville, Tennessee. THE ULTRARICH'S DIRTY SECRET: NOT PAYING TAXES The tax situation in the United States is pretty messed up. Much ado has been made over the last week about AOC suggesting bringing back the pre-Reagan era 70% marginal tax rate on regular income over $10,000,000. This means that if you made $12 million in normal income in a single year, you would pay 70% of the last $2 million to the

federal government.

FILTERING OUT DUPLICATE PANDAS.DATAFRAME ROWS Sean Taylor recently alerted me to the fact that there wasn't an easy way to filter out duplicate rows in a pandas DataFrame. R has the duplicated function which serves this purpose quite nicely. The R method's implementation is kind of kludgy in my opinion (from "The data frame method works by pasting together a character representation of the rows"), but in any case I set about writing a FEATHER FORMAT UPDATE: WHENCE AND WHITHER? Earlier this year, development for the Feather file format moved to the Apache Arrow codebase. I will explain how this has already affected Feather and what to WES MCKINNEYABOUTTALKSBOOKBLOG I published the first edition in 2012, and the 2nd edition was published in 2017. Click to read more PYTHON FOR DATA ANALYSIS BOOK The 2nd Edition of my book was released digitally on September 25, 2017, with print copies shipping a few weeks later. The 1st Edition was published in October, 2012. Where to buy? Buy DRM-free PDF on eBooks.com Buy the 2nd Edition in English on Amazon.com in print or FROM ARROW TO PANDAS AT 10 GIGABYTES PER SECONDAPACHE ARROW EXAMPLESAPACHE ARROW JAVASCRIPTAPACHE ARROW SPARKAPACHE ARROW TUTORIAL At 9.71 GB/s, this is not far from saturating the main memory bandwidth on my consumer desktop hardware (but I am not an expert on this).. The performance benefits of multithreading can be more dramatic on other hardware. While the performance ratio on my desktop is only 1.53, on my (also quad-core) laptop it is 3.29.. Note that numeric data is a best-case scenario; string or binary data NATIVE HADOOP FILE SYSTEM (HDFS) CONNECTIVITY IN PYTHON There have been many Python libraries developed for interacting with the Hadoop File System, HDFS, via its WebHDFS gateway as well as its native Protocol Buffers-based RPC interface. I'll give you an overview of what's out there and show some engineering I've been doing to offer a high performance HDFS interface within the developing Arrow ecosystem. This blog is a follow up to my APACHE ARROW AND THE "10 THINGS I HATE ABOUT PANDAS" STREAMING COLUMNAR DATA WITH APACHE ARROW Over the past couple weeks, Nong Li and I added a streaming binary format to Apache Arrow, accompanying the existing random access / IPC file format. We have implementations in Java and C++, plus Python bindings. In this post, I explain how the format works and show how you can achieve very high data throughput to pandas DataFrames. DO AVERAGE CONSUMERS STILL NEED DROPBOX? TL;DR: At the risk of stating the obvious, manual management of files on disks now in 2016 is increasingly old-fashioned and largely unnecessary, especially among the non-technorati. Encapsulated / managed cloud services and consumer web applications have made it anachronistic for most normal people. Whether this is a good thing can be debated, but it is happening nonetheless. SPYING ON INSTANCE METHODS WITH PYTHON'S MOCK MODULE Now, let's suppose you are testing the functionality of ProductionClass, but you want to observe the parameters passed to your internal methods but still invoke those internal methods.I didn't find a lot of examples of this from my Google searches, so here is the solution using unittest.mock (or mock from PyPI if you're on Legacy

Python 2.x):

construction.

STREAMING COLUMNAR DATA WITH APACHE ARROW Over the past couple weeks, Nong Li and I added a streaming binary format to Apache Arrow, accompanying the existing random access / IPC file format. We have implementations in Java and C++, plus Python bindings. In this post, I explain how the format works and show how you can achieve very high data throughput to pandas DataFrames. DO AVERAGE CONSUMERS STILL NEED DROPBOX? TL;DR: At the risk of stating the obvious, manual management of files on disks now in 2016 is increasingly old-fashioned and largely unnecessary, especially among the non-technorati. Encapsulated / managed cloud services and consumer web applications have made it anachronistic for most normal people. Whether this is a good thing can be debated, but it is happening nonetheless. AVOID UNSIGNED INTEGERS IN C++ IF YOU CAN Unsigned integers (size_t, uint32_t, and friends) can be hazardous, as signed-to-unsigned integer conversions can happen without so much as a compiler warning.An example: size_t as an index variable Occasionally, discussions come up about using unsigned integers as index variables for STL containers (whose size() attribute is unsigned). So the debate is between effectively these two WHY PANDAS USERS SHOULD BE EXCITED ABOUT APACHE ARROW Why pandas users should be excited about Apache Arrow. I'm super excited to be involved in the new open source Apache Arrow community initiative. For Python (and R, too!), it will help enable. Closer to native performance Python extensions for big data systems like Apache Spark. New in-memory analytics functionality for nested / JSON-like

data.

SPYING ON INSTANCE METHODS WITH PYTHON'S MOCK MODULE Spying on instance methods with Python's mock module. Python's mock module ( unittest.mock in Python 3.3 and higher) allows you to observe parameters passed to functions. I'm a little slow, so I had to dig around to figure out how to do this. Let's say you have a class: Now, let's suppose you are testing the functionality of ProductionClass ABOUT - WES MCKINNEY statsmodels (website, code). I worked on time series models (e.g. VAR) and pandas integration. Long form biography. I'm an American computer programmer and the Director of Ursa Labs.I studied theoretical mathematics at MIT (graduating in late 2006) before becoming very interested in programming and tools for data analysis, especially for industry use cases, in 2007.

PRESENTATIONS

Slides 2019-01-11

EVEN EASIER FREQUENCY TABLES IN PANDAS 0.7.0 This makes it very easy to produce an easy-on-the-eyes frequency table. crosstab can also take NumPy arrays. Suppose we had 1 million draws from a normal distribution, and we wish to produce a histogram-like table showing the number of draws whose absolute values fall into the bins defined by .Also, let's divide things

up by sign.

federal government.

LEAVING NYC FOR NASHVILLE For ten out of the last eleven years, I've lived in two places: New York City and San Francisco. The last two years have been in NYC. After founding Ursa Labs, a not-for-profit open source development group, I felt it was time to make my home somewhere that isn't either of those places.After some contemplation and consulting many friends, I decided on Nashville, Tennessee. SPEEDING UP PANDAS'S FILE PARSERS WITH CYTHON For multi-dimensional arrays, you specify the number of dimensions in the buffer. and pass multiple indexes (Py_ssize_t is the proper C "index type" to use).I'll demonstrate this in: Converting rows to columns faster than zip(*rows) UPDATE ON UPCOMING PANDAS V0.10, NEW FILE PARSER, OTHER If you know much about this data set, you know most of these columns are not interesting to analyze. New in pandas v0.10 you can specify a subset of columns right in read_csv which results in both much faster parsing time and lower memory usage (since we're throwing away the data from the other columns after tokenizing the file): A ROADMAP FOR RICH SCIENTIFIC DATA STRUCTURES IN PYTHON Discussion thread on Hacker News So, this post is a bit of a brain dump on rich data structures in Python and what needs to happen in the very near future. I care about them for statistical computing (I want to build a statistical computing environment that trounces R) and DON'T SELL ON AMAZON Selling your stuff on Amazon is a losing game, and I don't recommend that you do it. Let me tell you what happened to me recently. I wasn't using my iPad so I decided to sell it on Amazon. COMPILING DATAFRAME CODE IS HARDER THAN IT LOOKS Many people have asked me about the proliferation of DataFrame APIs like Spark DataFrames, Ibis, Blaze, and others. As it turns out, executing pandas-like code in a scalable environment is a difficult compiler engineering problem to enable composable, imperative Python or R code to be translated into a SQL or Spark/MapReduce-like

representation.

Toggle navigation Wes McKinney

* __About

* __Talks

* __Book

* __GitHub

* __Twitter

* __Blog

I published the first edition in 2012, and the 2nd edition was published in 2017. CLICK TO READ MORE

BLOG

I write about technical topics (mostly related to Python programming), with occasional diversions into other things that interest me.

ABOUT ME

Learn more about my present past life and projects I've worked on.

PRESENTATIONS

I give frequent presentations and tutorials about my work at Python and industry conferences and user meetups.

CODE

Most of my open source software is hosted on GitHub. -------------------------

� 2019 Wes McKinney

__ Back to top

Details

Image Url

HTML Url

Moderation By

More Annotations

Earl Hamilton

2020-12-11 01:53:43

Earl Hamilton

2020-12-11 01:54:17

Earl Hamilton

2020-12-11 01:55:00

Earl Hamilton

2020-12-11 01:55:21

Earl Hamilton

2020-12-11 01:55:30

Earl Hamilton

2020-12-11 01:55:48

Earl Hamilton

2020-12-11 01:56:38

Earl Hamilton

2020-12-11 01:56:55

Earl Hamilton

2020-12-11 01:57:03

Earl Hamilton

2020-12-11 01:57:28

Earl Hamilton

2020-12-11 01:58:22

Earl Hamilton

2020-12-11 01:58:35

Favourite Annotations

Earl Hamilton

2019-08-28 15:17:42

Earl Hamilton

2019-08-28 15:18:10

Earl Hamilton

2019-08-28 15:18:45

Earl Hamilton

2019-08-28 15:21:04

Earl Hamilton

2019-08-28 15:21:14

Earl Hamilton

2019-08-28 15:21:35

Earl Hamilton

2019-08-28 15:22:08

Earl Hamilton

2019-08-28 15:22:28

Earl Hamilton

2019-08-28 15:22:48

Earl Hamilton

2019-08-28 15:25:24

Earl Hamilton

2019-08-28 15:27:57

Earl Hamilton

2019-08-28 15:28:38

Text

construction.

data.

construction.

data.

PRESENTATIONS

Slides 2019-01-11

federal government.

construction.

data.

construction.

data.

PRESENTATIONS

Slides 2019-01-11

federal government.

Python 2.x):

construction.

data.

PRESENTATIONS

Slides 2019-01-11

up by sign.

federal government.

representation.

* __About

* __Talks

* __Book

* __GitHub

* __Twitter

* __Blog

BLOG